lab_4_auto_completion package

Submodules

Lab 4

class lab_4_auto_completion.main.DynamicBackOffGenerator(dynamic_trie: DynamicNgramLMTrie, processor: WordProcessor)

Bases: BackOffGenerator

Dynamic back-off generator based on dynamic N-gram trie.

__init__(dynamic_trie: DynamicNgramLMTrie, processor: WordProcessor) None

Initialize an DynamicNgramLMTrie.

Parameters:
  • dynamic_trie (DynamicNgramLMTrie) – Dynamic trie to use for text generation.

  • processor (WordProcessor) – A WordProcessor instance to handle text processing.

_dynamic_trie: DynamicNgramLMTrie

Dynamic trie for text generation

get_next_token(sequence_to_continue: tuple[int, ...]) dict[int, float] | None

Retrieve next tokens for sequence continuation.

Parameters:

sequence_to_continue (tuple[int, ...]) – Sequence to continue

Returns:

Next tokens for sequence continuation

Return type:

dict[int, float] | None

run(seq_len: int, prompt: str) str | None

Generate sequence based on dynamic N-gram trie and prompt provided.

Parameters:
  • seq_len (int) – Number of tokens to generate

  • prompt (str) – Beginning of sequence

Returns:

Generated sequence

Return type:

str | None

class lab_4_auto_completion.main.DynamicNgramLMTrie(encoded_corpus: tuple[tuple[int, ...], ...], n_gram_size: int = 3)

Bases: NGramTrieLanguageModel

Trie specialized in storing all possible N-grams tries.

__init__(encoded_corpus: tuple[tuple[int, ...], ...], n_gram_size: int = 3) None

Initialize an DynamicNgramLMTrie.

Parameters:
  • encoded_corpus (tuple[NGramType, ...]) – Tokenized corpus.

  • n_gram_size (int, optional) – N-gram size. Defaults to 3.

_assign_child(parent: TrieNode, node_name: int, freq: float = 0.0) TrieNode

Return an existing child with name of node or create a new one.

Parameters:
  • parent (TrieNode) – A sequence to match beginning of N-grams for continuation.

  • node_name (int) – Name of TrieNode to find a child.

  • freq (float, optional) – Frequency of child TrieNode.

Returns:

Existing or new TrieNode.

Return type:

TrieNode

_current_n_gram_size: int

Current size of ngrams

_encoded_corpus: tuple[tuple[int, ...], ...]

Encoded corpus to generate text

_insert_trie(source_root: TrieNode) None

Insert all nodes of source root trie into our main root.

Parameters:

source_root (TrieNode) – Source root to insert tree

_max_ngram_size: int

Maximum ngram size

_merge() None

Merge all built N-gram trie models into a single unified trie.

_models: dict[int, NGramTrieLanguageModel]

Models for text generation

_root: TrieNode

Initial state of the tree

build() int

Build N-gram tries for all possible ngrams based on a corpus of tokens.

Returns:

0 if attribute is filled successfully, otherwise 1.

Return type:

int

generate_next_token(sequence: tuple[int, ...]) dict[int, float] | None

Retrieve tokens that can continue the given sequence along with their probabilities.

Parameters:

sequence (tuple[int, ...]) – A sequence to match beginning of N-grams for continuation.

Returns:

Possible next tokens with their probabilities.

Return type:

dict[int, float] | None

set_current_ngram_size(current_n_gram_size: int | None) None

Set the active N-gram size used for generation.

Parameters:

current_n_gram_size (int | None) – Current N-gram size for generation.

class lab_4_auto_completion.main.NGramTrieLanguageModel(encoded_corpus: tuple | None, n_gram_size: int)

Bases: PrefixTrie, NGramLanguageModel

Trie specialized for storing and updating n-grams with frequency information.

__init__(encoded_corpus: tuple | None, n_gram_size: int) None

Initialize an NGramTrieLanguageModel.

Parameters:
  • encoded_corpus (tuple | None) – Encoded text

  • n_gram_size (int) – A size of n-grams to use for language modelling

__str__() str

Return a string representation of the NGramTrieLanguageModel.

Returns:

String representation showing n-gram size.

Return type:

str

_collect_all_ngrams() tuple[tuple[int, ...], ...]

Collect all n-grams from the trie by traversing all paths of length n_gram_size.

Returns:

Tuple of all n-grams stored in the trie.

Return type:

tuple[NGramType, …]

_collect_frequencies(node: TrieNode) dict[int, float]

Collect frequencies from immediate child nodes only.

Parameters:

node (TrieNode) – Current node.

Returns:

Collected frequencies of items.

Return type:

dict[int, float]

_fill_frequencies(encoded_corpus: tuple[tuple[int, ...], ...]) None

Calculate and assign frequencies for nodes in the trie based on corpus statistics.

Counts occurrences of each n-gram and stores the relative frequency on the last node of each n-gram sequence.

Parameters:

encoded_corpus (tuple[NGramType, ...]) – Tuple of n-grams extracted from the corpus.

_n_gram_size: int

N-gram window size used for building the trie

build() int

Build the trie using sliding n-gram windows from a tokenized corpus.

Returns:

0 if attribute is filled successfully, otherwise 1

Return type:

int

generate_next_token(sequence: tuple[int, ...]) dict[int, float] | None

Retrieve tokens that can continue the given sequence along with their probabilities.

Uses the last (n_gram_size - 1) tokens as context to predict the next token.

Parameters:

sequence (NGramType) – A sequence to match beginning of NGrams for continuation

Returns:

Possible next tokens with their probabilities,

or None if input is invalid or context is too short

Return type:

dict[int, float] | None

get_n_gram_size() int

Get the configured n-gram size.

Returns:

The current n-gram size.

Return type:

int

get_next_tokens(start_sequence: tuple[int, ...]) dict[int, float]

Get all possible next tokens and their relative frequencies for a given prefix.

Parameters:

start_sequence (NGramType) – The prefix sequence.

Returns:

Mapping of token → relative frequency.

Return type:

dict[int, float]

get_node_by_prefix(prefix: tuple[int, ...]) TrieNode

Get the node corresponding to a prefix in the trie.

Parameters:

prefix (NGramType) – Prefix to find node by.

Returns:

Found node by prefix.

Return type:

TrieNode

get_root() TrieNode

Get the root. :returns: Found root. :rtype: TrieNode

update(new_corpus: tuple[tuple[int, ...]]) None

Update the trie with additional data and refresh frequency values.

Parameters:

new_corpus (tuple[NGramType]) – Additional corpus represented as token sequences.

lab_4_auto_completion.main.NGramType

Type alias for NGram.

class lab_4_auto_completion.main.PrefixTrie

Bases: object

Prefix tree for storing token sequences.

__init__() None

Initialize an empty PrefixTrie.

_insert(sequence: tuple[int, ...]) None

Inserts a token in PrefixTrie

Parameters:

sequence (NGramType) – Tokens to insert.

_root: TrieNode

Initial state of the tree

clean() None

Clean the whole tree.

fill(encoded_corpus: tuple[tuple[int, ...]]) None

Fill the trie based on an encoded_corpus of tokens.

Parameters:

encoded_corpus (tuple[NGramType]) – Tokenized corpus.

get_prefix(prefix: tuple[int, ...]) TrieNode

Find the node corresponding to a prefix.

Parameters:

prefix (NGramType) – Prefix to find trie by.

Returns:

Found TrieNode by prefix

Return type:

TrieNode

suggest(prefix: tuple[int, ...]) tuple

Return all sequences in the trie that start with the given prefix.

Parameters:

prefix (NGramType) – Prefix to search for.

Returns:

Tuple of all token sequences that begin with the given prefix.

Empty tuple if prefix not found.

Return type:

tuple

class lab_4_auto_completion.main.TrieNode(name: int | None = None, value: float = 0.0)

Bases: object

Node type for PrefixTrie.

__bool__() bool

Define the boolean value of the node.

Returns:

True if node has at least one child, False otherwise.

Return type:

bool

__init__(name: int | None = None, value: float = 0.0) None

Initialize a Trie node.

Parameters:
  • name (int | None, optional) – The name of the node.

  • value (float, optional) – The value stored in the node.

__name: int | None

Saved item in current TrieNode

__str__() str

Return a string representation of the N-gram node.

Returns:

String representation showing node data and frequency.

Return type:

str

_children: list[TrieNode]

Children nodes

_value: float

Additional payload to store in TrieNode

add_child(item: int) None

Add a new child node with the given item.

Parameters:

item (int) – Data value for the new child node.

get_children(item: int | None = None) tuple[TrieNode, ...]

Get the tuple of child nodes or one child.

Parameters:

item (int | None, optional) – Special data to find special child

Returns:

Tuple of child nodes.

Return type:

tuple[“TrieNode”, …]

get_name() int | None

Get the data stored in the node.

Returns:

TrieNode data.

Return type:

int | None

get_value() float

Get the value of the node.

Returns:

Frequency value.

Return type:

float

has_children() bool

Check whether the node has any children.

Returns:

True if node has at least one child, False otherwise.

Return type:

bool

set_value(new_value: float) None

Set the value of the node

Parameters:

new_value (float) – New value to store.

class lab_4_auto_completion.main.WordProcessor(end_of_sentence_token: str)

Bases: TextProcessor

Handle text tokenization, encoding and decoding at word level.

Inherits from TextProcessor but reworks logic to work with words instead of letters.

__init__(end_of_sentence_token: str) None

Initialize an instance of SentenceStorage.

Parameters:

end_of_sentence_token (str) – A token denoting sentence boundary

_end_of_sentence_token: str

Special token to separate sentences

_postprocess_decoded_text(decoded_corpus: tuple[str, ...]) str

Convert decoded sentence into the string sequence.

Special symbols (end_of_sentence_token) separate sentences. The first letter is capitalized, resulting sequence must end with a full stop.

Parameters:

decoded_corpus (tuple[str, ...]) – A tuple of decoded words

Returns:

Resulting text

Return type:

str

_put(element: str) None

Put an element into the storage, assign a unique id to it.

Parameters:

element (str) – An element to put into storage

In case of corrupt input arguments or invalid argument length, an element is not added to storage

_tokenize(text: str) tuple[str, ...]

Tokenize text into words, separating sentences with special token.

Punctuation and digits are removed from words. Sentences are separated by the end_of_sentence_token.

Parameters:

text (str) – Original text

Returns:

Tokenized text as words

Return type:

tuple[str, …]

encode_sentences(text: str) tuple

Encode text and split into sentences.

Encodes text and returns a tuple of sentence sequences, where each sentence is represented as a tuple of word IDs. Sentences are separated by the end_of_sentence_token in the encoded text.

Parameters:

text (str) – Original text to encode

Returns:

Tuple of encoded sentences, each as a tuple of word IDs

Return type:

tuple

lab_4_auto_completion.main.load(path: str) DynamicNgramLMTrie

Load DynamicNgramLMTrie from file.

Parameters:

path (str) – Trie path

Returns:

Trie from file.

Return type:

DynamicNgramLMTrie

lab_4_auto_completion.main.save(trie: DynamicNgramLMTrie, path: str) None

Save DynamicNgramLMTrie.

Parameters: