lab_3_generate_by_ngrams package
Submodules
Lab 3.
Beam-search and natural language generation evaluation
- class lab_3_generate_by_ngrams.main.BackOffGenerator(language_models: tuple[NGramLanguageModel, ...], text_processor: TextProcessor)
Bases:
object
Language model for back-off based text generation.
- _language_models
Language models for next token prediction
- Type:
- _text_processor
A TextProcessor instance to handle text processing
- Type:
- __init__(language_models: tuple[NGramLanguageModel, ...], text_processor: TextProcessor)
Initializes an instance of BackOffGenerator.
- Parameters:
language_models (tuple[NGramLanguageModel, ...]) – Language models to use for text generation
text_processor (TextProcessor) – A TextProcessor instance to handle text processing
- _get_next_token(sequence_to_continue: tuple[int, ...]) dict[int, float] | None
Retrieve next tokens for sequence continuation.
- Parameters:
sequence_to_continue (tuple[int, ...]) – Sequence to continue
- Returns:
Next tokens for sequence continuation
- Return type:
In case of corrupt input arguments return None.
- class lab_3_generate_by_ngrams.main.BeamSearchTextGenerator(language_model: NGramLanguageModel, text_processor: TextProcessor, beam_width: int)
Bases:
object
Class for text generation with BeamSearch.
- _language_model
Language models for next token prediction
- Type:
- _text_processor
A TextProcessor instance to handle text processing
- Type:
- beam_searcher
Searcher instances for each language model
- Type:
- __init__(language_model: NGramLanguageModel, text_processor: TextProcessor, beam_width: int)
Initializes an instance of BeamSearchTextGenerator.
- Parameters:
language_model (NGramLanguageModel) – Language model to use for text generation
text_processor (TextProcessor) – A TextProcessor instance to handle text processing
beam_width (int) – Beam width parameter for generation
- _get_next_token(sequence_to_continue: tuple[int, ...]) list[tuple[int, float]] | None
Retrieve next tokens for sequence continuation.
- Parameters:
sequence_to_continue (tuple[int, ...]) – Sequence to continue
- Returns:
Next tokens for sequence continuation
- Return type:
In case of corrupt input arguments return None.
- class lab_3_generate_by_ngrams.main.BeamSearcher(beam_width: int, language_model: NGramLanguageModel)
Bases:
object
Beam Search algorithm for diverse text generation.
- _model
A language model to use for next token prediction
- Type:
- __init__(beam_width: int, language_model: NGramLanguageModel) None
Initialize an instance of BeamSearchAlgorithm.
- Parameters:
beam_width (int) – Number of candidates to consider at each step
language_model (NGramLanguageModel) – A language model to use for next token prediction
- continue_sequence(sequence: tuple[int, ...], next_tokens: list[tuple[int, float]], sequence_candidates: dict[tuple[int, ...], float]) dict[tuple[int, ...], float] | None
Generate new sequences from the base sequence with next tokens provided.
The base sequence is deleted after continued variations are added.
- Parameters:
- Returns:
Updated sequence candidates
- Return type:
In case of corrupt input arguments or unexpected behaviour of methods used return None.
- get_next_token(sequence: tuple[int, ...]) list[tuple[int, float]] | None
Retrieves candidate tokens for sequence continuation.
The valid candidate tokens are those that are included in the N-gram with. Number of tokens retrieved must not be bigger that beam width parameter.
- Parameters:
- Returns:
- Tokens to use for base sequence continuation.
The return value has the following format: [(token, probability), …]. The return value length matches the Beam Size parameter.
- Return type:
In case of corrupt input arguments or methods used return None.
- prune_sequence_candidates(sequence_candidates: dict[tuple[int, ...], float]) dict[tuple[int, ...], float] | None
Remove those sequence candidates that do not make top-N most probable sequences.
- Parameters:
sequence_candidates (dict[tuple[int, ...], float]) – Current candidate sequences
- Returns:
Pruned sequences
- Return type:
In case of corrupt input arguments return None.
- class lab_3_generate_by_ngrams.main.GreedyTextGenerator(language_model: NGramLanguageModel, text_processor: TextProcessor)
Bases:
object
Greedy text generation by N-grams.
- _model
A language model to use for text generation
- Type:
- _text_processor
A TextProcessor instance to handle text processing
- Type:
- __init__(language_model: NGramLanguageModel, text_processor: TextProcessor) None
Initialize an instance of GreedyTextGenerator.
- Parameters:
language_model (NGramLanguageModel) – A language model to use for text generation
text_processor (TextProcessor) – A TextProcessor instance to handle text processing
- class lab_3_generate_by_ngrams.main.NGramLanguageModel(encoded_corpus: tuple | None, n_gram_size: int)
Bases:
object
Store language model by n_grams, predict the next token.
- __init__(encoded_corpus: tuple | None, n_gram_size: int) None
Initialize an instance of NGramLanguageModel.
- _extract_n_grams(encoded_corpus: tuple[int, ...]) tuple[tuple[int, ...], ...] | None
Split encoded sequence into n-grams.
- Parameters:
encoded_corpus (tuple[int, ...]) – A tuple of encoded tokens
- Returns:
A tuple of extracted n-grams
- Return type:
In case of corrupt input arguments, None is returned
- build() int
Fill attribute _n_gram_frequencies from encoded corpus.
Encoded corpus is stored in the attribute _encoded_corpus
- Returns:
0 if attribute is filled successfully, otherwise 1
- Return type:
In case of corrupt input arguments or methods used return None, 1 is returned
- generate_next_token(sequence: tuple[int, ...]) dict | None
Retrieve tokens that can continue the given sequence along with their probabilities.
- Parameters:
sequence (tuple[int, ...]) – A sequence to match beginning of NGrams for continuation
- Returns:
Possible next tokens with their probabilities
- Return type:
Optional[dict]
In case of corrupt input arguments, None is returned
- class lab_3_generate_by_ngrams.main.NGramLanguageModelReader(json_path: str, eow_token: str)
Bases:
object
Factory for loading language models ngrams from external JSON.
- _text_processor
A TextProcessor instance to handle text processing
- Type:
- get_text_processor() TextProcessor
Get method for the processor created for the current JSON file.
- Returns:
processor created for the current JSON file.
- Return type:
- load(n_gram_size: int) NGramLanguageModel | None
Fill attribute _n_gram_frequencies from dictionary with N-grams.
The N-grams taken from dictionary must be cleaned from digits and punctuation, their length must match n_gram_size, and spaces must be replaced with EoW token.
- Parameters:
n_gram_size (int) – Size of ngram
- Returns:
Built language model.
- Return type:
In case of corrupt input arguments or unexpected behaviour of methods used, return 1.
- class lab_3_generate_by_ngrams.main.TextProcessor(end_of_word_token: str)
Bases:
object
Handle text tokenization, encoding and decoding.
- __init__(end_of_word_token: str) None
Initialize an instance of LetterStorage.
- Parameters:
end_of_word_token (str) – A token denoting word boundary
- _decode(corpus: tuple[int, ...]) tuple[str, ...] | None
Decode sentence by replacing ids with corresponding letters.
- Parameters:
- Returns:
Sequence with decoded tokens
- Return type:
In case of corrupt input arguments, None is returned. In case any of methods used return None, None is returned.
- _postprocess_decoded_text(decoded_corpus: tuple[str, ...]) str | None
Convert decoded sentence into the string sequence.
Special symbols are replaced with spaces (no multiple spaces in a row are allowed). The first letter is capitalized, resulting sequence must end with a full stop.
- Parameters:
decoded_corpus (tuple[str, ...]) – A tuple of decoded tokens
- Returns:
Resulting text
- Return type:
In case of corrupt input arguments, None is returned
- _put(element: str) None
Put an element into the storage, assign a unique id to it.
- Parameters:
element (str) – An element to put into storage
In case of corrupt input arguments or invalid argument length, an element is not added to storage
- _tokenize(text: str) tuple[str, ...] | None
Tokenize text into unigrams, separating words with special token.
Punctuation and digits are removed. EoW token is appended after the last word in two cases: 1. It is followed by punctuation 2. It is followed by space symbol
In case of corrupt input arguments, None is returned. In case any of methods used return None, None is returned.
- decode(encoded_corpus: tuple[int, ...]) str | None
Decode and postprocess encoded corpus by converting integer identifiers to string.
Special symbols are replaced with spaces (no multiple spaces in a row are allowed). The first letter is capitalized, resulting sequence must end with a full stop.
- Parameters:
encoded_corpus (tuple[int, ...]) – A tuple of encoded tokens
- Returns:
Resulting text
- Return type:
In case of corrupt input arguments, None is returned. In case any of methods used return None, None is returned.
- encode(text: str) tuple[int, ...] | None
Encode text.
Tokenize text, assign each symbol an integer identifier and replace letters with their ids.
- Parameters:
text (str) – An original text to be encoded
- Returns:
Processed text
- Return type:
In case of corrupt input arguments, None is returned. In case any of methods used return None, None is returned.
- fill_from_ngrams(content: dict) None
Fill internal storage with letters from external JSON.
- Parameters:
content (dict) – ngrams from external JSON
- get_end_of_word_token() str
Retrieve value stored in self._end_of_word_token attribute.
- Returns:
EoW token
- Return type:
- get_id(element: str) int | None
Retrieve a unique identifier of an element.
- Parameters:
element (str) – String element to retrieve identifier for
- Returns:
Integer identifier that corresponds to the given element
- Return type:
In case of corrupt input arguments or arguments not included in storage, None is returned
- get_token(element_id: int) str | None
Retrieve an element by unique identifier.
- Parameters:
element_id (int) – Identifier to retrieve identifier for
- Returns:
Element that corresponds to the given identifier
- Return type:
In case of corrupt input arguments or arguments not included in storage, None is returned