lab_4_auto_completion package

Submodules

Lab 4

class lab_4_auto_completion.main.DynamicBackOffGenerator(dynamic_trie: DynamicNgramLMTrie, processor: WordProcessor)

Bases: BackOffGenerator

Dynamic back-off generator based on dynamic N-gram trie.

__init__(dynamic_trie: DynamicNgramLMTrie, processor: WordProcessor) → None

Initialize an DynamicNgramLMTrie.

Parameters:

dynamic_trie (DynamicNgramLMTrie) – Dynamic trie to use for text generation.
processor (WordProcessor) – A WordProcessor instance to handle text processing.

_dynamic_trie: DynamicNgramLMTrie: Dynamic trie for text generation

get_next_token(sequence_to_continue: tuple[int, ...]) → dict[int, float] | None

Retrieve next tokens for sequence continuation.

Parameters:: sequence_to_continue (tuple[int, ...]) – Sequence to continue
Returns:: Next tokens for sequence continuation
Return type:: dict[int, float] | None

run(seq_len: int, prompt: str) → str | None

Generate sequence based on dynamic N-gram trie and prompt provided.

Parameters:

seq_len (int) – Number of tokens to generate
prompt (str) – Beginning of sequence

Returns:

Generated sequence

Return type:

str | None

class lab_4_auto_completion.main.DynamicNgramLMTrie(encoded_corpus: tuple[tuple[int, ...], ...], n_gram_size: int = 3)

Bases: NGramTrieLanguageModel

Trie specialized in storing all possible N-grams tries.

__init__(encoded_corpus: tuple[tuple[int, ...], ...], n_gram_size: int = 3) → None

Initialize an DynamicNgramLMTrie.

Parameters:

encoded_corpus (tuple[NGramType, ...]) – Tokenized corpus.
n_gram_size (int, optional) – N-gram size. Defaults to 3.

_assign_child(parent: TrieNode, node_name: int, freq: float = 0.0) → TrieNode

Return an existing child with name of node or create a new one.

Parameters:

parent (TrieNode) – A sequence to match beginning of N-grams for continuation.
node_name (int) – Name of TrieNode to find a child.
freq (float, optional) – Frequency of child TrieNode.

Returns:

Existing or new TrieNode.

Return type:

TrieNode

_current_n_gram_size: int: Current size of ngrams

_encoded_corpus: tuple[tuple[int, ...], ...]: Encoded corpus to generate text

_insert_trie(source_root: TrieNode) → None

Insert all nodes of source root trie into our main root.

Parameters:: source_root (TrieNode) – Source root to insert tree

_max_ngram_size: int: Maximum ngram size

_merge() → None: Merge all built N-gram trie models into a single unified trie.

_models: dict[int, NGramTrieLanguageModel]: Models for text generation

_root: TrieNode: Initial state of the tree

build() → int

Build N-gram tries for all possible ngrams based on a corpus of tokens.

Returns:: 0 if attribute is filled successfully, otherwise 1.
Return type:: int

generate_next_token(sequence: tuple[int, ...]) → dict[int, float] | None

Retrieve tokens that can continue the given sequence along with their probabilities.

Parameters:: sequence (tuple[int, ...]) – A sequence to match beginning of N-grams for continuation.
Returns:: Possible next tokens with their probabilities.
Return type:: dict[int, float] | None

set_current_ngram_size(current_n_gram_size: int | None) → None

Set the active N-gram size used for generation.

Parameters:: current_n_gram_size (int | None) – Current N-gram size for generation.

class lab_4_auto_completion.main.NGramTrieLanguageModel(encoded_corpus: tuple | None, n_gram_size: int)

Bases: PrefixTrie, NGramLanguageModel

Trie specialized for storing and updating n-grams with frequency information.

__init__(encoded_corpus: tuple | None, n_gram_size: int) → None

Initialize an NGramTrieLanguageModel.

Parameters:

encoded_corpus (tuple | None) – Encoded text
n_gram_size (int) – A size of n-grams to use for language modelling

__str__() → str

Return a string representation of the NGramTrieLanguageModel.

Returns:: String representation showing n-gram size.
Return type:: str

_collect_all_ngrams() → tuple[tuple[int, ...], ...]

Collect all n-grams from the trie by traversing all paths of length n_gram_size.

Returns:: Tuple of all n-grams stored in the trie.
Return type:: tuple[NGramType, …]

_collect_frequencies(node: TrieNode) → dict[int, float]

Collect frequencies from immediate child nodes only.

Parameters:: node (TrieNode) – Current node.
Returns:: Collected frequencies of items.
Return type:: dict[int, float]

_fill_frequencies(encoded_corpus: tuple[tuple[int, ...], ...]) → None

Calculate and assign frequencies for nodes in the trie based on corpus statistics.

Counts occurrences of each n-gram and stores the relative frequency on the last node of each n-gram sequence.

Parameters:: encoded_corpus (tuple[NGramType, ...]) – Tuple of n-grams extracted from the corpus.

_n_gram_size: int: N-gram window size used for building the trie

build() → int

Build the trie using sliding n-gram windows from a tokenized corpus.

Returns:: 0 if attribute is filled successfully, otherwise 1
Return type:: int

generate_next_token(sequence: tuple[int, ...]) → dict[int, float] | None

Retrieve tokens that can continue the given sequence along with their probabilities.

Uses the last (n_gram_size - 1) tokens as context to predict the next token.

Parameters:

sequence (NGramType) – A sequence to match beginning of NGrams for continuation

Returns:

Possible next tokens with their probabilities,: or None if input is invalid or context is too short

Return type:

dict[int, float] | None

get_n_gram_size() → int

Get the configured n-gram size.

Returns:: The current n-gram size.
Return type:: int

get_next_tokens(start_sequence: tuple[int, ...]) → dict[int, float]

Get all possible next tokens and their relative frequencies for a given prefix.

Parameters:: start_sequence (NGramType) – The prefix sequence.
Returns:: Mapping of token → relative frequency.
Return type:: dict[int, float]

get_node_by_prefix(prefix: tuple[int, ...]) → TrieNode

Get the node corresponding to a prefix in the trie.

Parameters:: prefix (NGramType) – Prefix to find node by.
Returns:: Found node by prefix.
Return type:: TrieNode

get_root() → TrieNode: Get the root. :returns: Found root. :rtype: TrieNode

update(new_corpus: tuple[tuple[int, ...]]) → None

Update the trie with additional data and refresh frequency values.

Parameters:: new_corpus (tuple[NGramType]) – Additional corpus represented as token sequences.

lab_4_auto_completion.main.NGramType: Type alias for NGram.

class lab_4_auto_completion.main.PrefixTrie

Bases: object

Prefix tree for storing token sequences.

__init__() → None: Initialize an empty PrefixTrie.

_insert(sequence: tuple[int, ...]) → None

Inserts a token in PrefixTrie

Parameters:: sequence (NGramType) – Tokens to insert.

_root: TrieNode: Initial state of the tree

clean() → None: Clean the whole tree.

fill(encoded_corpus: tuple[tuple[int, ...]]) → None

Fill the trie based on an encoded_corpus of tokens.

Parameters:: encoded_corpus (tuple[NGramType]) – Tokenized corpus.

get_prefix(prefix: tuple[int, ...]) → TrieNode

Find the node corresponding to a prefix.

Parameters:: prefix (NGramType) – Prefix to find trie by.
Returns:: Found TrieNode by prefix
Return type:: TrieNode

suggest(prefix: tuple[int, ...]) → tuple

Return all sequences in the trie that start with the given prefix.

Parameters:

prefix (NGramType) – Prefix to search for.

Returns:

Tuple of all token sequences that begin with the given prefix.: Empty tuple if prefix not found.

Return type:

tuple

class lab_4_auto_completion.main.TrieNode(name: int | None = None, value: float = 0.0)

Bases: object

Node type for PrefixTrie.

__bool__() → bool

Define the boolean value of the node.

Returns:: True if node has at least one child, False otherwise.
Return type:: bool

__init__(name: int | None = None, value: float = 0.0) → None

Initialize a Trie node.

Parameters:

name (int | None, optional) – The name of the node.
value (float, optional) – The value stored in the node.

__name: int | None: Saved item in current TrieNode

__str__() → str

Return a string representation of the N-gram node.

Returns:: String representation showing node data and frequency.
Return type:: str

_children: list[TrieNode]: Children nodes

_value: float: Additional payload to store in TrieNode

add_child(item: int) → None

Add a new child node with the given item.

Parameters:: item (int) – Data value for the new child node.

get_children(item: int | None = None) → tuple[TrieNode, ...]

Get the tuple of child nodes or one child.

Parameters:: item (int | None, optional) – Special data to find special child
Returns:: Tuple of child nodes.
Return type:: tuple[“TrieNode”, …]

get_name() → int | None

Get the data stored in the node.

Returns:: TrieNode data.
Return type:: int | None

get_value() → float

Get the value of the node.

Returns:: Frequency value.
Return type:: float

has_children() → bool

Check whether the node has any children.

Returns:: True if node has at least one child, False otherwise.
Return type:: bool

set_value(new_value: float) → None

Set the value of the node

Parameters:: new_value (float) – New value to store.

class lab_4_auto_completion.main.WordProcessor(end_of_sentence_token: str)

Bases: TextProcessor

Handle text tokenization, encoding and decoding at word level.

Inherits from TextProcessor but reworks logic to work with words instead of letters.

__init__(end_of_sentence_token: str) → None

Initialize an instance of SentenceStorage.

Parameters:: end_of_sentence_token (str) – A token denoting sentence boundary

_end_of_sentence_token: str: Special token to separate sentences

_postprocess_decoded_text(decoded_corpus: tuple[str, ...]) → str

Convert decoded sentence into the string sequence.

Special symbols (end_of_sentence_token) separate sentences. The first letter is capitalized, resulting sequence must end with a full stop.

Parameters:: decoded_corpus (tuple[str, ...]) – A tuple of decoded words
Returns:: Resulting text
Return type:: str

_put(element: str) → None

Put an element into the storage, assign a unique id to it.

Parameters:: element (str) – An element to put into storage

In case of corrupt input arguments or invalid argument length, an element is not added to storage

_tokenize(text: str) → tuple[str, ...]

Tokenize text into words, separating sentences with special token.

Punctuation and digits are removed from words. Sentences are separated by the end_of_sentence_token.

Parameters:: text (str) – Original text
Returns:: Tokenized text as words
Return type:: tuple[str, …]

encode_sentences(text: str) → tuple

Encode text and split into sentences.

Encodes text and returns a tuple of sentence sequences, where each sentence is represented as a tuple of word IDs. Sentences are separated by the end_of_sentence_token in the encoded text.

Parameters:: text (str) – Original text to encode
Returns:: Tuple of encoded sentences, each as a tuple of word IDs
Return type:: tuple

lab_4_auto_completion.main.load(path: str) → DynamicNgramLMTrie

Load DynamicNgramLMTrie from file.

Parameters:: path (str) – Trie path
Returns:: Trie from file.
Return type:: DynamicNgramLMTrie

lab_4_auto_completion.main.save(trie: DynamicNgramLMTrie, path: str) → None

Save DynamicNgramLMTrie.

Parameters:

trie (DynamicNgramLMTrie) – Trie for saving
path (str) – Path for saving