lab_3_generate_by_ngrams package

Submodules

Lab 3.

Beam-search and natural language generation evaluation

class lab_3_generate_by_ngrams.main.BackOffGenerator(language_models: tuple[NGramLanguageModel, ...], text_processor: TextProcessor)

Bases: object

Language model for back-off based text generation.

_language_models

Language models for next token prediction

Type:

tuple[NGramLanguageModel, …]

_text_processor

A TextProcessor instance to handle text processing

Type:

TextProcessor

__init__(language_models: tuple[NGramLanguageModel, ...], text_processor: TextProcessor)

Initializes an instance of BackOffGenerator.

Parameters:
  • language_models (tuple[NGramLanguageModel, ...]) – Language models to use for text generation

  • text_processor (TextProcessor) – A TextProcessor instance to handle text processing

_get_next_token(sequence_to_continue: tuple[int, ...]) dict[int, float] | None

Retrieve next tokens for sequence continuation.

Parameters:

sequence_to_continue (tuple[int, ...]) – Sequence to continue

Returns:

Next tokens for sequence continuation

Return type:

dict[int, float]

In case of corrupt input arguments return None.

run(seq_len: int, prompt: str) str | None

Generate sequence based on NGram language model and prompt provided.

Parameters:
  • seq_len (int) – Number of tokens to generate

  • prompt (str) – Beginning of sequence

Returns:

Generated sequence

Return type:

str

In case of corrupt input arguments or methods used return None, None is returned

class lab_3_generate_by_ngrams.main.BeamSearchTextGenerator(language_model: NGramLanguageModel, text_processor: TextProcessor, beam_width: int)

Bases: object

Class for text generation with BeamSearch.

_language_model

Language models for next token prediction

Type:

tuple[NGramLanguageModel]

_text_processor

A TextProcessor instance to handle text processing

Type:

TextProcessor

_beam_width

Beam width parameter for generation

Type:

int

beam_searcher

Searcher instances for each language model

Type:

BeamSearcher

__init__(language_model: NGramLanguageModel, text_processor: TextProcessor, beam_width: int)

Initializes an instance of BeamSearchTextGenerator.

Parameters:
  • language_model (NGramLanguageModel) – Language model to use for text generation

  • text_processor (TextProcessor) – A TextProcessor instance to handle text processing

  • beam_width (int) – Beam width parameter for generation

_get_next_token(sequence_to_continue: tuple[int, ...]) list[tuple[int, float]] | None

Retrieve next tokens for sequence continuation.

Parameters:

sequence_to_continue (tuple[int, ...]) – Sequence to continue

Returns:

Next tokens for sequence continuation

Return type:

list[tuple[int, float]]

In case of corrupt input arguments return None.

run(prompt: str, seq_len: int) str | None

Generate sequence based on NGram language model and prompt provided.

Parameters:
  • seq_len (int) – Number of tokens to generate

  • prompt (str) – Beginning of sequence

Returns:

Generated sequence

Return type:

str

In case of corrupt input arguments or methods used return None, None is returned

class lab_3_generate_by_ngrams.main.BeamSearcher(beam_width: int, language_model: NGramLanguageModel)

Bases: object

Beam Search algorithm for diverse text generation.

_beam_width

Number of candidates to consider at each step

Type:

int

_model

A language model to use for next token prediction

Type:

NGramLanguageModel

__init__(beam_width: int, language_model: NGramLanguageModel) None

Initialize an instance of BeamSearchAlgorithm.

Parameters:
  • beam_width (int) – Number of candidates to consider at each step

  • language_model (NGramLanguageModel) – A language model to use for next token prediction

continue_sequence(sequence: tuple[int, ...], next_tokens: list[tuple[int, float]], sequence_candidates: dict[tuple[int, ...], float]) dict[tuple[int, ...], float] | None

Generate new sequences from the base sequence with next tokens provided.

The base sequence is deleted after continued variations are added.

Parameters:
  • sequence (tuple[int, ...]) – Base sequence to continue

  • next_tokens (list[tuple[int, float]]) – Token for sequence continuation

  • sequence_candidates (dict[tuple[int, ...], float]) – Storage with all sequences generated

Returns:

Updated sequence candidates

Return type:

dict[tuple[int, …], float]

In case of corrupt input arguments or unexpected behaviour of methods used return None.

get_next_token(sequence: tuple[int, ...]) list[tuple[int, float]] | None

Retrieves candidate tokens for sequence continuation.

The valid candidate tokens are those that are included in the N-gram with. Number of tokens retrieved must not be bigger that beam width parameter.

Parameters:

sequence (tuple[int, ...]) – Base sequence to continue

Returns:

Tokens to use for base sequence continuation.

The return value has the following format: [(token, probability), …]. The return value length matches the Beam Size parameter.

Return type:

list[tuple[int, float]]

In case of corrupt input arguments or methods used return None.

prune_sequence_candidates(sequence_candidates: dict[tuple[int, ...], float]) dict[tuple[int, ...], float] | None

Remove those sequence candidates that do not make top-N most probable sequences.

Parameters:

sequence_candidates (dict[tuple[int, ...], float]) – Current candidate sequences

Returns:

Pruned sequences

Return type:

dict[tuple[int, …], float]

In case of corrupt input arguments return None.

class lab_3_generate_by_ngrams.main.GreedyTextGenerator(language_model: NGramLanguageModel, text_processor: TextProcessor)

Bases: object

Greedy text generation by N-grams.

_model

A language model to use for text generation

Type:

NGramLanguageModel

_text_processor

A TextProcessor instance to handle text processing

Type:

TextProcessor

__init__(language_model: NGramLanguageModel, text_processor: TextProcessor) None

Initialize an instance of GreedyTextGenerator.

Parameters:
  • language_model (NGramLanguageModel) – A language model to use for text generation

  • text_processor (TextProcessor) – A TextProcessor instance to handle text processing

run(seq_len: int, prompt: str) str | None

Generate sequence based on NGram language model and prompt provided.

Parameters:
  • seq_len (int) – Number of tokens to generate

  • prompt (str) – Beginning of sequence

Returns:

Generated sequence

Return type:

str

In case of corrupt input arguments or methods used return None, None is returned

class lab_3_generate_by_ngrams.main.NGramLanguageModel(encoded_corpus: tuple | None, n_gram_size: int)

Bases: object

Store language model by n_grams, predict the next token.

_n_gram_size

A size of n-grams to use for language modelling

Type:

int

_n_gram_frequencies

Frequencies for n-grams

Type:

dict

_encoded_corpus

Encoded text

Type:

tuple

__init__(encoded_corpus: tuple | None, n_gram_size: int) None

Initialize an instance of NGramLanguageModel.

Parameters:
  • encoded_corpus (tuple) – Encoded text

  • n_gram_size (int) – A size of n-grams to use for language modelling

_extract_n_grams(encoded_corpus: tuple[int, ...]) tuple[tuple[int, ...], ...] | None

Split encoded sequence into n-grams.

Parameters:

encoded_corpus (tuple[int, ...]) – A tuple of encoded tokens

Returns:

A tuple of extracted n-grams

Return type:

tuple[tuple[int, …], …]

In case of corrupt input arguments, None is returned

build() int

Fill attribute _n_gram_frequencies from encoded corpus.

Encoded corpus is stored in the attribute _encoded_corpus

Returns:

0 if attribute is filled successfully, otherwise 1

Return type:

int

In case of corrupt input arguments or methods used return None, 1 is returned

generate_next_token(sequence: tuple[int, ...]) dict | None

Retrieve tokens that can continue the given sequence along with their probabilities.

Parameters:

sequence (tuple[int, ...]) – A sequence to match beginning of NGrams for continuation

Returns:

Possible next tokens with their probabilities

Return type:

Optional[dict]

In case of corrupt input arguments, None is returned

get_n_gram_size() int

Retrieve value stored in self._n_gram_size attribute.

Returns:

Size of stored n_grams

Return type:

int

set_n_grams(frequencies: dict) None

Setter method for n-gram frequencies.

Parameters:

frequencies (dict) – Computed in advance frequencies for n-grams

class lab_3_generate_by_ngrams.main.NGramLanguageModelReader(json_path: str, eow_token: str)

Bases: object

Factory for loading language models ngrams from external JSON.

_json_path

Local path to assets file

Type:

str

_eow_token

Special token for text processor

Type:

str

_text_processor

A TextProcessor instance to handle text processing

Type:

TextProcessor

_content

N-grams from external JSON

Type:

dict

__init__(json_path: str, eow_token: str) None

Initialize reader instance.

Parameters:
  • json_path (str) – Local path to assets file

  • eow_token (str) – Special token for text processor

get_text_processor() TextProcessor

Get method for the processor created for the current JSON file.

Returns:

processor created for the current JSON file.

Return type:

TextProcessor

load(n_gram_size: int) NGramLanguageModel | None

Fill attribute _n_gram_frequencies from dictionary with N-grams.

The N-grams taken from dictionary must be cleaned from digits and punctuation, their length must match n_gram_size, and spaces must be replaced with EoW token.

Parameters:

n_gram_size (int) – Size of ngram

Returns:

Built language model.

Return type:

NGramLanguageModel

In case of corrupt input arguments or unexpected behaviour of methods used, return 1.

class lab_3_generate_by_ngrams.main.TextProcessor(end_of_word_token: str)

Bases: object

Handle text tokenization, encoding and decoding.

_end_of_word_token

A token denoting word boundary

Type:

str

_storage

Dictionary in the form of <token: identifier>

Type:

dict

__init__(end_of_word_token: str) None

Initialize an instance of LetterStorage.

Parameters:

end_of_word_token (str) – A token denoting word boundary

_decode(corpus: tuple[int, ...]) tuple[str, ...] | None

Decode sentence by replacing ids with corresponding letters.

Parameters:

corpus (tuple[int, ...]) – A tuple of encoded tokens

Returns:

Sequence with decoded tokens

Return type:

tuple[str, …]

In case of corrupt input arguments, None is returned. In case any of methods used return None, None is returned.

_postprocess_decoded_text(decoded_corpus: tuple[str, ...]) str | None

Convert decoded sentence into the string sequence.

Special symbols are replaced with spaces (no multiple spaces in a row are allowed). The first letter is capitalized, resulting sequence must end with a full stop.

Parameters:

decoded_corpus (tuple[str, ...]) – A tuple of decoded tokens

Returns:

Resulting text

Return type:

str

In case of corrupt input arguments, None is returned

_put(element: str) None

Put an element into the storage, assign a unique id to it.

Parameters:

element (str) – An element to put into storage

In case of corrupt input arguments or invalid argument length, an element is not added to storage

_tokenize(text: str) tuple[str, ...] | None

Tokenize text into unigrams, separating words with special token.

Punctuation and digits are removed. EoW token is appended after the last word in two cases: 1. It is followed by punctuation 2. It is followed by space symbol

Parameters:

text (str) – Original text

Returns:

Tokenized text

Return type:

tuple[str, …]

In case of corrupt input arguments, None is returned. In case any of methods used return None, None is returned.

decode(encoded_corpus: tuple[int, ...]) str | None

Decode and postprocess encoded corpus by converting integer identifiers to string.

Special symbols are replaced with spaces (no multiple spaces in a row are allowed). The first letter is capitalized, resulting sequence must end with a full stop.

Parameters:

encoded_corpus (tuple[int, ...]) – A tuple of encoded tokens

Returns:

Resulting text

Return type:

str

In case of corrupt input arguments, None is returned. In case any of methods used return None, None is returned.

encode(text: str) tuple[int, ...] | None

Encode text.

Tokenize text, assign each symbol an integer identifier and replace letters with their ids.

Parameters:

text (str) – An original text to be encoded

Returns:

Processed text

Return type:

tuple[int, …]

In case of corrupt input arguments, None is returned. In case any of methods used return None, None is returned.

fill_from_ngrams(content: dict) None

Fill internal storage with letters from external JSON.

Parameters:

content (dict) – ngrams from external JSON

get_end_of_word_token() str

Retrieve value stored in self._end_of_word_token attribute.

Returns:

EoW token

Return type:

str

get_id(element: str) int | None

Retrieve a unique identifier of an element.

Parameters:

element (str) – String element to retrieve identifier for

Returns:

Integer identifier that corresponds to the given element

Return type:

int

In case of corrupt input arguments or arguments not included in storage, None is returned

get_token(element_id: int) str | None

Retrieve an element by unique identifier.

Parameters:

element_id (int) – Identifier to retrieve identifier for

Returns:

Element that corresponds to the given identifier

Return type:

str

In case of corrupt input arguments or arguments not included in storage, None is returned