lab_2_tokenize_by_bpe package
Submodules
Lab 2.
BPE and machine translation evaluation
- lab_2_tokenize_by_bpe.main.calculate_bleu(actual: str | None, reference: str, max_order: int = 3) float | None
Compare two sequences by virtue of BLEU metric.
- Parameters:
- Returns:
A value of BLEU metric
- Return type:
In case of corrupt input arguments or functions used return None, None is returned
- lab_2_tokenize_by_bpe.main.calculate_precision(actual: list[tuple[str, ...]], reference: list[tuple[str, ...]]) float | None
Compare two sequences by virtue of Precision metric.
- Parameters:
- Returns:
Value of Precision metric
- Return type:
In case of corrupt input arguments, None is returned
- lab_2_tokenize_by_bpe.main.collect_frequencies(text: str, start_of_word: str | None, end_of_word: str) dict[tuple[str, ...], int] | None
Count number of occurrences of each word.
- Parameters:
- Returns:
- Dictionary in the form of
<preprocessed word: number of occurrences>
- Return type:
In case of corrupt input arguments or functions used return None, None is returned
- lab_2_tokenize_by_bpe.main.collect_ngrams(text: str, order: int) list[tuple[str, ...]] | None
Extract n-grams from the given sequence.
- Parameters:
- Returns:
A sequence of n-grams
- Return type:
In case of corrupt input arguments, None is returned
- lab_2_tokenize_by_bpe.main.count_tokens_pairs(word_frequencies: dict[tuple[str, ...], int]) dict[tuple[str, str], int] | None
Count number of occurrences of each pair of subsequent tokens.
- Parameters:
word_frequencies (dict[tuple[str, ...], int]) – A dictionary in the form of <preprocessed word: number of occurrences>
- Returns:
- A dictionary in the form of
<token pair: number of occurrences>
- Return type:
In case of corrupt input arguments, None is returned
- lab_2_tokenize_by_bpe.main.decode(encoded_text: list[int] | None, vocabulary: dict[str, int] | None, end_of_word_token: str | None) str | None
Translate encoded sequence into decoded one.
- Parameters:
- Returns:
Decoded sequence
- Return type:
In case of corrupt input arguments, None is returned
- lab_2_tokenize_by_bpe.main.encode(original_text: str, vocabulary: dict[str, int] | None, start_of_word_token: str | None, end_of_word_token: str | None, unknown_token: str) list[int] | None
Translate decoded sequence into encoded one.
- Parameters:
- Returns:
A list of token identifiers
- Return type:
In case of corrupt input arguments or functions used return None, None is returned
- lab_2_tokenize_by_bpe.main.geo_mean(precisions: list[float], max_order: int) float | None
Compute geometric mean of sequence of values.
- Parameters:
- Returns:
A value of geometric mean of Precision metric
- Return type:
In case of corrupt input arguments, None is returned
- lab_2_tokenize_by_bpe.main.get_vocabulary(word_frequencies: dict[tuple[str, ...], int], unknown_token: str) dict[str, int] | None
Establish correspondence between tokens and its integer identifier.
- Parameters:
- Returns:
A dictionary in the form of <token: identifier>
- Return type:
In case of corrupt input arguments, None is returned
- lab_2_tokenize_by_bpe.main.load_vocabulary(vocab_path: str) dict[str, int] | None
Read and retrieve dictionary of type <token: identifier>.
- Parameters:
vocab_path (str) – A path to the saved vocabulary
- Returns:
A dictionary in the form of <token: identifier>
- Return type:
In case of corrupt input arguments, None is returned
- lab_2_tokenize_by_bpe.main.merge_tokens(word_frequencies: dict[tuple[str, ...], int], pair: tuple[str, str]) dict[tuple[str, ...], int] | None
Update word frequency dictionary by replacing a pair of token with a merged one.
- Parameters:
- Returns:
- A dictionary in the form of
<preprocessed word: number of occurrences>
- Return type:
In case of corrupt input arguments, None is returned
- lab_2_tokenize_by_bpe.main.prepare_word(raw_word: str, start_of_word: str | None, end_of_word: str | None) tuple[str, ...] | None
Tokenize word into unigrams and append end-of-word token.
- Parameters:
- Returns:
Preprocessed word
- Return type:
In case of corrupt input arguments, None is returned
- lab_2_tokenize_by_bpe.main.tokenize_word(word: tuple[str, ...], vocabulary: dict[str, int], end_of_word: str | None, unknown_token: str) list[int] | None
Split word into tokens.
- Parameters:
- Returns:
A list of token identifiers
- Return type:
In case of corrupt input arguments, None is returned
- lab_2_tokenize_by_bpe.main.train(word_frequencies: dict[tuple[str, ...], int] | None, num_merges: int) dict[tuple[str, ...], int] | None
Create required number of new tokens by merging existing ones.
- Parameters:
- Returns:
- A dictionary in the form of
<preprocessed word: number of occurrences>
- Return type:
In case of corrupt input arguments or functions used return None, None is returned