lab_3_generate_by_ngrams package

Submodules

Lab 3.

Beam-search and natural language generation evaluation

class lab_3_generate_by_ngrams.main.BackOffGenerator(language_models: tuple[NGramLanguageModel, ...], text_processor: TextProcessor)

Bases: object

Language model for back-off based text generation.

_language_models

Language models for next token prediction

Type:: tuple[NGramLanguageModel, …]

_text_processor

A TextProcessor instance to handle text processing

Type:: TextProcessor

__init__(language_models: tuple[NGramLanguageModel, ...], text_processor: TextProcessor)

Initializes an instance of BackOffGenerator.

Parameters:

language_models (tuple[NGramLanguageModel, ...]) – Language models to use for text generation
text_processor (TextProcessor) – A TextProcessor instance to handle text processing

_get_next_token(sequence_to_continue: tuple[int, ...]) → dict[int, float] | None

Retrieve next tokens for sequence continuation.

Parameters:: sequence_to_continue (tuple[int, ...]) – Sequence to continue
Returns:: Next tokens for sequence continuation
Return type:: dict[int, float]

In case of corrupt input arguments return None.

run(seq_len: int, prompt: str) → str | None

Generate sequence based on NGram language model and prompt provided.

Parameters:

seq_len (int) – Number of tokens to generate
prompt (str) – Beginning of sequence

Returns:

Generated sequence

Return type:

str

In case of corrupt input arguments or methods used return None, None is returned

class lab_3_generate_by_ngrams.main.BeamSearchTextGenerator(language_model: NGramLanguageModel, text_processor: TextProcessor, beam_width: int)

Bases: object

Class for text generation with BeamSearch.

_language_model

Language models for next token prediction

Type:: tuple[NGramLanguageModel]

_text_processor

A TextProcessor instance to handle text processing

Type:: TextProcessor

_beam_width

Beam width parameter for generation

Type:: int

beam_searcher

Searcher instances for each language model

Type:: BeamSearcher

__init__(language_model: NGramLanguageModel, text_processor: TextProcessor, beam_width: int)

Initializes an instance of BeamSearchTextGenerator.

Parameters:

language_model (NGramLanguageModel) – Language model to use for text generation
text_processor (TextProcessor) – A TextProcessor instance to handle text processing
beam_width (int) – Beam width parameter for generation

_get_next_token(sequence_to_continue: tuple[int, ...]) → list[tuple[int, float]] | None

Retrieve next tokens for sequence continuation.

Parameters:: sequence_to_continue (tuple[int, ...]) – Sequence to continue
Returns:: Next tokens for sequence continuation
Return type:: list[tuple[int, float]]

In case of corrupt input arguments return None.

run(prompt: str, seq_len: int) → str | None

Generate sequence based on NGram language model and prompt provided.

Parameters:

seq_len (int) – Number of tokens to generate
prompt (str) – Beginning of sequence

Returns:

Generated sequence

Return type:

str

In case of corrupt input arguments or methods used return None, None is returned

class lab_3_generate_by_ngrams.main.BeamSearcher(beam_width: int, language_model: NGramLanguageModel)

Bases: object

Beam Search algorithm for diverse text generation.

_beam_width

Number of candidates to consider at each step

Type:: int

_model

A language model to use for next token prediction

Type:: NGramLanguageModel

__init__(beam_width: int, language_model: NGramLanguageModel) → None

Initialize an instance of BeamSearchAlgorithm.

Parameters:

beam_width (int) – Number of candidates to consider at each step
language_model (NGramLanguageModel) – A language model to use for next token prediction

continue_sequence(sequence: tuple[int, ...], next_tokens: list[tuple[int, float]], sequence_candidates: dict[tuple[int, ...], float]) → dict[tuple[int, ...], float] | None

Generate new sequences from the base sequence with next tokens provided.

The base sequence is deleted after continued variations are added.

Parameters:

sequence (tuple[int, ...]) – Base sequence to continue
next_tokens (list[tuple[int, float]]) – Token for sequence continuation
sequence_candidates (dict[tuple[int, ...], float]) – Storage with all sequences generated

Returns:

Updated sequence candidates

Return type:

dict[tuple[int, …], float]

In case of corrupt input arguments or unexpected behaviour of methods used return None.

get_next_token(sequence: tuple[int, ...]) → list[tuple[int, float]] | None

Retrieves candidate tokens for sequence continuation.

The valid candidate tokens are those that are included in the N-gram with. Number of tokens retrieved must not be bigger that beam width parameter.

Parameters:

sequence (tuple[int, ...]) – Base sequence to continue

Returns:

Tokens to use for base sequence continuation.: The return value has the following format: [(token, probability), …]. The return value length matches the Beam Size parameter.

Return type:

list[tuple[int, float]]

In case of corrupt input arguments or methods used return None.

prune_sequence_candidates(sequence_candidates: dict[tuple[int, ...], float]) → dict[tuple[int, ...], float] | None

Remove those sequence candidates that do not make top-N most probable sequences.

Parameters:: sequence_candidates (dict[tuple[int, ...], float]) – Current candidate sequences
Returns:: Pruned sequences
Return type:: dict[tuple[int, …], float]

In case of corrupt input arguments return None.

class lab_3_generate_by_ngrams.main.GreedyTextGenerator(language_model: NGramLanguageModel, text_processor: TextProcessor)

Bases: object

Greedy text generation by N-grams.

_model

A language model to use for text generation

Type:: NGramLanguageModel

_text_processor

A TextProcessor instance to handle text processing

Type:: TextProcessor

__init__(language_model: NGramLanguageModel, text_processor: TextProcessor) → None

Initialize an instance of GreedyTextGenerator.

Parameters:

language_model (NGramLanguageModel) – A language model to use for text generation
text_processor (TextProcessor) – A TextProcessor instance to handle text processing

run(seq_len: int, prompt: str) → str | None

Generate sequence based on NGram language model and prompt provided.

Parameters:

seq_len (int) – Number of tokens to generate
prompt (str) – Beginning of sequence

Returns:

Generated sequence

Return type:

str

In case of corrupt input arguments or methods used return None, None is returned

class lab_3_generate_by_ngrams.main.NGramLanguageModel(encoded_corpus: tuple | None, n_gram_size: int)

Bases: object

Store language model by n_grams, predict the next token.

_n_gram_size

A size of n-grams to use for language modelling

Type:: int

_n_gram_frequencies

Frequencies for n-grams

Type:: dict

_encoded_corpus

Encoded text

Type:: tuple

__init__(encoded_corpus: tuple | None, n_gram_size: int) → None

Initialize an instance of NGramLanguageModel.

Parameters:

encoded_corpus (tuple) – Encoded text
n_gram_size (int) – A size of n-grams to use for language modelling

_extract_n_grams(encoded_corpus: tuple[int, ...]) → tuple[tuple[int, ...], ...] | None

Split encoded sequence into n-grams.

Parameters:: encoded_corpus (tuple[int, ...]) – A tuple of encoded tokens
Returns:: A tuple of extracted n-grams
Return type:: tuple[tuple[int, …], …]

In case of corrupt input arguments, None is returned

build() → int

Fill attribute _n_gram_frequencies from encoded corpus.

Encoded corpus is stored in the attribute _encoded_corpus

Returns:: 0 if attribute is filled successfully, otherwise 1
Return type:: int

In case of corrupt input arguments or methods used return None, 1 is returned

generate_next_token(sequence: tuple[int, ...]) → dict | None

Retrieve tokens that can continue the given sequence along with their probabilities.

Parameters:: sequence (tuple[int, ...]) – A sequence to match beginning of NGrams for continuation
Returns:: Possible next tokens with their probabilities
Return type:: Optional[dict]

In case of corrupt input arguments, None is returned

get_n_gram_size() → int

Retrieve value stored in self._n_gram_size attribute.

Returns:: Size of stored n_grams
Return type:: int

set_n_grams(frequencies: dict) → None

Setter method for n-gram frequencies.

Parameters:: frequencies (dict) – Computed in advance frequencies for n-grams

class lab_3_generate_by_ngrams.main.NGramLanguageModelReader(json_path: str, eow_token: str)

Bases: object

Factory for loading language models ngrams from external JSON.

_json_path

Local path to assets file

Type:: str

_eow_token

Special token for text processor

Type:: str

_text_processor

A TextProcessor instance to handle text processing

Type:: TextProcessor

_content

N-grams from external JSON

Type:: dict

__init__(json_path: str, eow_token: str) → None

Initialize reader instance.

Parameters:

json_path (str) – Local path to assets file
eow_token (str) – Special token for text processor

get_text_processor() → TextProcessor

Get method for the processor created for the current JSON file.

Returns:: processor created for the current JSON file.
Return type:: TextProcessor

load(n_gram_size: int) → NGramLanguageModel | None

Fill attribute _n_gram_frequencies from dictionary with N-grams.

The N-grams taken from dictionary must be cleaned from digits and punctuation, their length must match n_gram_size, and spaces must be replaced with EoW token.

Parameters:: n_gram_size (int) – Size of ngram
Returns:: Built language model.
Return type:: NGramLanguageModel

In case of corrupt input arguments or unexpected behaviour of methods used, return 1.

class lab_3_generate_by_ngrams.main.TextProcessor(end_of_word_token: str)

Bases: object

Handle text tokenization, encoding and decoding.

_end_of_word_token

A token denoting word boundary

Type:: str

_storage

Dictionary in the form of <token: identifier>

Type:: dict

__init__(end_of_word_token: str) → None

Initialize an instance of LetterStorage.

Parameters:: end_of_word_token (str) – A token denoting word boundary

_decode(corpus: tuple[int, ...]) → tuple[str, ...] | None

Decode sentence by replacing ids with corresponding letters.

Parameters:: corpus (tuple[int, ...]) – A tuple of encoded tokens
Returns:: Sequence with decoded tokens
Return type:: tuple[str, …]

In case of corrupt input arguments, None is returned. In case any of methods used return None, None is returned.

_postprocess_decoded_text(decoded_corpus: tuple[str, ...]) → str | None

Convert decoded sentence into the string sequence.

Special symbols are replaced with spaces (no multiple spaces in a row are allowed). The first letter is capitalized, resulting sequence must end with a full stop.

Parameters:: decoded_corpus (tuple[str, ...]) – A tuple of decoded tokens
Returns:: Resulting text
Return type:: str

In case of corrupt input arguments, None is returned

_put(element: str) → None

Put an element into the storage, assign a unique id to it.

Parameters:: element (str) – An element to put into storage

In case of corrupt input arguments or invalid argument length, an element is not added to storage

_tokenize(text: str) → tuple[str, ...] | None

Tokenize text into unigrams, separating words with special token.

Punctuation and digits are removed. EoW token is appended after the last word in two cases: 1. It is followed by punctuation 2. It is followed by space symbol

Parameters:: text (str) – Original text
Returns:: Tokenized text
Return type:: tuple[str, …]

In case of corrupt input arguments, None is returned. In case any of methods used return None, None is returned.

decode(encoded_corpus: tuple[int, ...]) → str | None

Decode and postprocess encoded corpus by converting integer identifiers to string.

Special symbols are replaced with spaces (no multiple spaces in a row are allowed). The first letter is capitalized, resulting sequence must end with a full stop.

Parameters:: encoded_corpus (tuple[int, ...]) – A tuple of encoded tokens
Returns:: Resulting text
Return type:: str

In case of corrupt input arguments, None is returned. In case any of methods used return None, None is returned.

encode(text: str) → tuple[int, ...] | None

Encode text.

Tokenize text, assign each symbol an integer identifier and replace letters with their ids.

Parameters:: text (str) – An original text to be encoded
Returns:: Processed text
Return type:: tuple[int, …]

In case of corrupt input arguments, None is returned. In case any of methods used return None, None is returned.

fill_from_ngrams(content: dict) → None

Fill internal storage with letters from external JSON.

Parameters:: content (dict) – ngrams from external JSON

get_end_of_word_token() → str

Retrieve value stored in self._end_of_word_token attribute.

Returns:: EoW token
Return type:: str

get_id(element: str) → int | None

Retrieve a unique identifier of an element.

Parameters:: element (str) – String element to retrieve identifier for
Returns:: Integer identifier that corresponds to the given element
Return type:: int

In case of corrupt input arguments or arguments not included in storage, None is returned

get_token(element_id: int) → str | None

Retrieve an element by unique identifier.

Parameters:: element_id (int) – Identifier to retrieve identifier for
Returns:: Element that corresponds to the given identifier
Return type:: str

In case of corrupt input arguments or arguments not included in storage, None is returned