lab_2_tokenize_by_bpe package

Submodules

Lab 2.

BPE and machine translation evaluation

lab_2_tokenize_by_bpe.main.calculate_bleu(actual: str | None, reference: str, max_order: int = 3) → float | None

Compare two sequences by virtue of BLEU metric.

Parameters:

actual (str) – Predicted sequence
reference (str) – Expected sequence
max_order (int) – Max length of n-gram to consider for comparison

Returns:

A value of BLEU metric

Return type:

float

In case of corrupt input arguments or functions used return None, None is returned

lab_2_tokenize_by_bpe.main.calculate_precision(actual: list[tuple[str, ...]], reference: list[tuple[str, ...]]) → float | None

Compare two sequences by virtue of Precision metric.

Parameters:

actual (list[tuple[str, ...]]) – Predicted sequence of n-grams
reference (list[tuple[str, ...]]) – Expected sequence of n-grams

Returns:

Value of Precision metric

Return type:

float

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.collect_frequencies(text: str, start_of_word: str | None, end_of_word: str) → dict[tuple[str, ...], int] | None

Count number of occurrences of each word.

Parameters:

text (str) – Original text with no preprocessing
start_of_word (str) – A token that signifies the start of word
end_of_word (str) – A token that signifies the end of word

Returns:

Dictionary in the form of: <preprocessed word: number of occurrences>

Return type:

dict[tuple[str, …], int]

In case of corrupt input arguments or functions used return None, None is returned

lab_2_tokenize_by_bpe.main.collect_ngrams(text: str, order: int) → list[tuple[str, ...]] | None

Extract n-grams from the given sequence.

Parameters:

text (str) – Original text
order (int) – Required number of elements in a single n-gram

Returns:

A sequence of n-grams

Return type:

list[tuple[str, …]]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.count_tokens_pairs(word_frequencies: dict[tuple[str, ...], int]) → dict[tuple[str, str], int] | None

Count number of occurrences of each pair of subsequent tokens.

Parameters:

word_frequencies (dict[tuple[str, ...], int]) – A dictionary in the form of <preprocessed word: number of occurrences>

Returns:

A dictionary in the form of: <token pair: number of occurrences>

Return type:

dict[tuple[str, str], int]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.decode(encoded_text: list[int] | None, vocabulary: dict[str, int] | None, end_of_word_token: str | None) → str | None

Translate encoded sequence into decoded one.

Parameters:

encoded_text (list[int]) – A sequence of token identifiers
vocabulary (dict[str, int]) – A dictionary in the form of <token: identifier>
end_of_word_token (str) – An end-of-word token

Returns:

Decoded sequence

Return type:

str

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.encode(original_text: str, vocabulary: dict[str, int] | None, start_of_word_token: str | None, end_of_word_token: str | None, unknown_token: str) → list[int] | None

Translate decoded sequence into encoded one.

Parameters:

original_text (str) – Original text
vocabulary (dict[str, int]) – A dictionary in the form of <token: identifier>
start_of_word_token (str) – A start-of-word token
end_of_word_token (str) – An end-of-word token
unknown_token (str) – A token that signifies unknown sequence

Returns:

A list of token identifiers

Return type:

list[int]

In case of corrupt input arguments or functions used return None, None is returned

lab_2_tokenize_by_bpe.main.geo_mean(precisions: list[float], max_order: int) → float | None

Compute geometric mean of sequence of values.

Parameters:

precisions (list[float]) – A sequence of Precision values
max_order (int) – Maximum length of n-gram considered

Returns:

A value of geometric mean of Precision metric

Return type:

float

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.get_vocabulary(word_frequencies: dict[tuple[str, ...], int], unknown_token: str) → dict[str, int] | None

Establish correspondence between tokens and its integer identifier.

Parameters:

word_frequencies (dict[tuple[str, ...], int]) – A dictionary in the form of <preprocessed word: number of occurrences>
unknown_token (str) – A token to signify an unknown token

Returns:

A dictionary in the form of <token: identifier>

Return type:

dict[str, int]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.load_vocabulary(vocab_path: str) → dict[str, int] | None

Read and retrieve dictionary of type <token: identifier>.

Parameters:: vocab_path (str) – A path to the saved vocabulary
Returns:: A dictionary in the form of <token: identifier>
Return type:: dict[str, int]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.merge_tokens(word_frequencies: dict[tuple[str, ...], int], pair: tuple[str, str]) → dict[tuple[str, ...], int] | None

Update word frequency dictionary by replacing a pair of token with a merged one.

Parameters:

word_frequencies (dict[tuple[str, ...], int]) – A dictionary in the form of <preprocessed word: number of occurrences>
pair (tuple[str, str]) – A pair of tokens to be merged

Returns:

A dictionary in the form of: <preprocessed word: number of occurrences>

Return type:

dict[tuple[str, …], int]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.prepare_word(raw_word: str, start_of_word: str | None, end_of_word: str | None) → tuple[str, ...] | None

Tokenize word into unigrams and append end-of-word token.

Parameters:

raw_word (str) – Original word
start_of_word (str) – A token that signifies the start of word
end_of_word (str) – A token that signifies the end of word

Returns:

Preprocessed word

Return type:

tuple[str, …]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.tokenize_word(word: tuple[str, ...], vocabulary: dict[str, int], end_of_word: str | None, unknown_token: str) → list[int] | None

Split word into tokens.

Parameters:

word (tuple[str, ...]) – Preprocessed word
vocabulary (dict[str, int]) – A dictionary in the form of <token: identifier>
end_of_word (str) – An end-of-word token
unknown_token (str) – A token that signifies unknown sequence

Returns:

A list of token identifiers

Return type:

list[int]

In case of corrupt input arguments, None is returned

lab_2_tokenize_by_bpe.main.train(word_frequencies: dict[tuple[str, ...], int] | None, num_merges: int) → dict[tuple[str, ...], int] | None

Create required number of new tokens by merging existing ones.

Parameters:

word_frequencies (dict[tuple[str, ...], int]) – A dictionary in the form of <preprocessed word: number of occurrences>
num_merges (int) – Required number of new tokens

Returns:

A dictionary in the form of: <preprocessed word: number of occurrences>

Return type:

dict[tuple[str, …], int]

In case of corrupt input arguments or functions used return None, None is returned