lab_2_tokenize_by_bpe package

Submodules

Lab 2.

BPE and machine translation evaluation

lab_2_tokenize_by_bpe.main.calculate_bleu(actual: str | None, reference: str, max_order: int = 3) float | None

Compare two sequences by virtue of BLEU metric.

Parameters:
  • actual (str) – Predicted sequence

  • reference (str) – Expected sequence

  • max_order (int) – Max length of n-gram to consider for comparison

Returns:

A value of BLEU metric

Return type:

float

In case of corrupt input arguments or functions used return None, None is returned

lab_2_tokenize_by_bpe.main.calculate_precision(actual: list[tuple[str, ...]], reference: list[tuple[str, ...]]) float | None

Compare two sequences by virtue of Precision metric.

Parameters:
  • actual (list[tuple[str, ...]]) – Predicted sequence of n-grams

  • reference (list[tuple[str, ...]]) – Expected sequence of n-grams

Returns:

Value of Precision metric

Return type:

float

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.collect_frequencies(text: str, start_of_word: str | None, end_of_word: str) dict[tuple[str, ...], int] | None

Count number of occurrences of each word.

Parameters:
  • text (str) – Original text with no preprocessing

  • start_of_word (str) – A token that signifies the start of word

  • end_of_word (str) – A token that signifies the end of word

Returns:

Dictionary in the form of

<preprocessed word: number of occurrences>

Return type:

dict[tuple[str, …], int]

In case of corrupt input arguments or functions used return None, None is returned

lab_2_tokenize_by_bpe.main.collect_ngrams(text: str, order: int) list[tuple[str, ...]] | None

Extract n-grams from the given sequence.

Parameters:
  • text (str) – Original text

  • order (int) – Required number of elements in a single n-gram

Returns:

A sequence of n-grams

Return type:

list[tuple[str, …]]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.count_tokens_pairs(word_frequencies: dict[tuple[str, ...], int]) dict[tuple[str, str], int] | None

Count number of occurrences of each pair of subsequent tokens.

Parameters:

word_frequencies (dict[tuple[str, ...], int]) – A dictionary in the form of <preprocessed word: number of occurrences>

Returns:

A dictionary in the form of

<token pair: number of occurrences>

Return type:

dict[tuple[str, str], int]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.decode(encoded_text: list[int] | None, vocabulary: dict[str, int] | None, end_of_word_token: str | None) str | None

Translate encoded sequence into decoded one.

Parameters:
  • encoded_text (list[int]) – A sequence of token identifiers

  • vocabulary (dict[str, int]) – A dictionary in the form of <token: identifier>

  • end_of_word_token (str) – An end-of-word token

Returns:

Decoded sequence

Return type:

str

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.encode(original_text: str, vocabulary: dict[str, int] | None, start_of_word_token: str | None, end_of_word_token: str | None, unknown_token: str) list[int] | None

Translate decoded sequence into encoded one.

Parameters:
  • original_text (str) – Original text

  • vocabulary (dict[str, int]) – A dictionary in the form of <token: identifier>

  • start_of_word_token (str) – A start-of-word token

  • end_of_word_token (str) – An end-of-word token

  • unknown_token (str) – A token that signifies unknown sequence

Returns:

A list of token identifiers

Return type:

list[int]

In case of corrupt input arguments or functions used return None, None is returned

lab_2_tokenize_by_bpe.main.geo_mean(precisions: list[float], max_order: int) float | None

Compute geometric mean of sequence of values.

Parameters:
  • precisions (list[float]) – A sequence of Precision values

  • max_order (int) – Maximum length of n-gram considered

Returns:

A value of geometric mean of Precision metric

Return type:

float

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.get_vocabulary(word_frequencies: dict[tuple[str, ...], int], unknown_token: str) dict[str, int] | None

Establish correspondence between tokens and its integer identifier.

Parameters:
  • word_frequencies (dict[tuple[str, ...], int]) – A dictionary in the form of <preprocessed word: number of occurrences>

  • unknown_token (str) – A token to signify an unknown token

Returns:

A dictionary in the form of <token: identifier>

Return type:

dict[str, int]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.load_vocabulary(vocab_path: str) dict[str, int] | None

Read and retrieve dictionary of type <token: identifier>.

Parameters:

vocab_path (str) – A path to the saved vocabulary

Returns:

A dictionary in the form of <token: identifier>

Return type:

dict[str, int]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.merge_tokens(word_frequencies: dict[tuple[str, ...], int], pair: tuple[str, str]) dict[tuple[str, ...], int] | None

Update word frequency dictionary by replacing a pair of token with a merged one.

Parameters:
  • word_frequencies (dict[tuple[str, ...], int]) – A dictionary in the form of <preprocessed word: number of occurrences>

  • pair (tuple[str, str]) – A pair of tokens to be merged

Returns:

A dictionary in the form of

<preprocessed word: number of occurrences>

Return type:

dict[tuple[str, …], int]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.prepare_word(raw_word: str, start_of_word: str | None, end_of_word: str | None) tuple[str, ...] | None

Tokenize word into unigrams and append end-of-word token.

Parameters:
  • raw_word (str) – Original word

  • start_of_word (str) – A token that signifies the start of word

  • end_of_word (str) – A token that signifies the end of word

Returns:

Preprocessed word

Return type:

tuple[str, …]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.tokenize_word(word: tuple[str, ...], vocabulary: dict[str, int], end_of_word: str | None, unknown_token: str) list[int] | None

Split word into tokens.

Parameters:
  • word (tuple[str, ...]) – Preprocessed word

  • vocabulary (dict[str, int]) – A dictionary in the form of <token: identifier>

  • end_of_word (str) – An end-of-word token

  • unknown_token (str) – A token that signifies unknown sequence

Returns:

A list of token identifiers

Return type:

list[int]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.train(word_frequencies: dict[tuple[str, ...], int] | None, num_merges: int) dict[tuple[str, ...], int] | None

Create required number of new tokens by merging existing ones.

Parameters:
  • word_frequencies (dict[tuple[str, ...], int]) – A dictionary in the form of <preprocessed word: number of occurrences>

  • num_merges (int) – Required number of new tokens

Returns:

A dictionary in the form of

<preprocessed word: number of occurrences>

Return type:

dict[tuple[str, …], int]

In case of corrupt input arguments or functions used return None, None is returned