lab_4_auto_completion package
Submodules
Lab 4
- class lab_4_auto_completion.main.DynamicBackOffGenerator(dynamic_trie: DynamicNgramLMTrie, processor: WordProcessor)
Bases:
BackOffGeneratorDynamic back-off generator based on dynamic N-gram trie.
- __init__(dynamic_trie: DynamicNgramLMTrie, processor: WordProcessor) None
Initialize an DynamicNgramLMTrie.
- Parameters:
dynamic_trie (DynamicNgramLMTrie) – Dynamic trie to use for text generation.
processor (WordProcessor) – A WordProcessor instance to handle text processing.
- _dynamic_trie: DynamicNgramLMTrie
Dynamic trie for text generation
- class lab_4_auto_completion.main.DynamicNgramLMTrie(encoded_corpus: tuple[tuple[int, ...], ...], n_gram_size: int = 3)
Bases:
NGramTrieLanguageModelTrie specialized in storing all possible N-grams tries.
- __init__(encoded_corpus: tuple[tuple[int, ...], ...], n_gram_size: int = 3) None
Initialize an DynamicNgramLMTrie.
- _assign_child(parent: TrieNode, node_name: int, freq: float = 0.0) TrieNode
Return an existing child with name of node or create a new one.
- _insert_trie(source_root: TrieNode) None
Insert all nodes of source root trie into our main root.
- Parameters:
source_root (TrieNode) – Source root to insert tree
- _models: dict[int, NGramTrieLanguageModel]
Models for text generation
- build() int
Build N-gram tries for all possible ngrams based on a corpus of tokens.
- Returns:
0 if attribute is filled successfully, otherwise 1.
- Return type:
- class lab_4_auto_completion.main.NGramTrieLanguageModel(encoded_corpus: tuple | None, n_gram_size: int)
Bases:
PrefixTrie,NGramLanguageModelTrie specialized for storing and updating n-grams with frequency information.
- __init__(encoded_corpus: tuple | None, n_gram_size: int) None
Initialize an NGramTrieLanguageModel.
- __str__() str
Return a string representation of the NGramTrieLanguageModel.
- Returns:
String representation showing n-gram size.
- Return type:
- _collect_all_ngrams() tuple[tuple[int, ...], ...]
Collect all n-grams from the trie by traversing all paths of length n_gram_size.
- Returns:
Tuple of all n-grams stored in the trie.
- Return type:
tuple[NGramType, …]
- _collect_frequencies(node: TrieNode) dict[int, float]
Collect frequencies from immediate child nodes only.
- _fill_frequencies(encoded_corpus: tuple[tuple[int, ...], ...]) None
Calculate and assign frequencies for nodes in the trie based on corpus statistics.
Counts occurrences of each n-gram and stores the relative frequency on the last node of each n-gram sequence.
- Parameters:
encoded_corpus (tuple[NGramType, ...]) – Tuple of n-grams extracted from the corpus.
- build() int
Build the trie using sliding n-gram windows from a tokenized corpus.
- Returns:
0 if attribute is filled successfully, otherwise 1
- Return type:
- generate_next_token(sequence: tuple[int, ...]) dict[int, float] | None
Retrieve tokens that can continue the given sequence along with their probabilities.
Uses the last (n_gram_size - 1) tokens as context to predict the next token.
- get_n_gram_size() int
Get the configured n-gram size.
- Returns:
The current n-gram size.
- Return type:
- get_next_tokens(start_sequence: tuple[int, ...]) dict[int, float]
Get all possible next tokens and their relative frequencies for a given prefix.
- lab_4_auto_completion.main.NGramType
Type alias for NGram.
- class lab_4_auto_completion.main.PrefixTrie
Bases:
objectPrefix tree for storing token sequences.
- _insert(sequence: tuple[int, ...]) None
Inserts a token in PrefixTrie
- Parameters:
sequence (NGramType) – Tokens to insert.
- fill(encoded_corpus: tuple[tuple[int, ...]]) None
Fill the trie based on an encoded_corpus of tokens.
- Parameters:
encoded_corpus (tuple[NGramType]) – Tokenized corpus.
- class lab_4_auto_completion.main.TrieNode(name: int | None = None, value: float = 0.0)
Bases:
objectNode type for PrefixTrie.
- __bool__() bool
Define the boolean value of the node.
- Returns:
True if node has at least one child, False otherwise.
- Return type:
- __str__() str
Return a string representation of the N-gram node.
- Returns:
String representation showing node data and frequency.
- Return type:
- add_child(item: int) None
Add a new child node with the given item.
- Parameters:
item (int) – Data value for the new child node.
- get_children(item: int | None = None) tuple[TrieNode, ...]
Get the tuple of child nodes or one child.
- get_name() int | None
Get the data stored in the node.
- Returns:
TrieNode data.
- Return type:
int | None
- class lab_4_auto_completion.main.WordProcessor(end_of_sentence_token: str)
Bases:
TextProcessorHandle text tokenization, encoding and decoding at word level.
Inherits from TextProcessor but reworks logic to work with words instead of letters.
- __init__(end_of_sentence_token: str) None
Initialize an instance of SentenceStorage.
- Parameters:
end_of_sentence_token (str) – A token denoting sentence boundary
- _postprocess_decoded_text(decoded_corpus: tuple[str, ...]) str
Convert decoded sentence into the string sequence.
Special symbols (end_of_sentence_token) separate sentences. The first letter is capitalized, resulting sequence must end with a full stop.
- _put(element: str) None
Put an element into the storage, assign a unique id to it.
- Parameters:
element (str) – An element to put into storage
In case of corrupt input arguments or invalid argument length, an element is not added to storage
- _tokenize(text: str) tuple[str, ...]
Tokenize text into words, separating sentences with special token.
Punctuation and digits are removed from words. Sentences are separated by the end_of_sentence_token.
- lab_4_auto_completion.main.load(path: str) DynamicNgramLMTrie
Load DynamicNgramLMTrie from file.
- Parameters:
path (str) – Trie path
- Returns:
Trie from file.
- Return type:
- lab_4_auto_completion.main.save(trie: DynamicNgramLMTrie, path: str) None
Save DynamicNgramLMTrie.
- Parameters:
trie (DynamicNgramLMTrie) – Trie for saving
path (str) – Path for saving