lab_2_spellcheck package

Submodules

Lab 2.

lab_2_spellcheck.main.add_letter(word: str, alphabet: list[str]) → list[str]

Generate all possible words by inserting a letter from the alphabet at every possible position in the word.

Parameters:

word (str) – The input incorrect word.
alphabet (list[str]) – The alphabet with letters.

Returns:

A list of words with one additional letter inserted.

Return type:

list[str]

In case of corrupt input arguments, empty list is returned.

lab_2_spellcheck.main.build_vocabulary(tokens: list[str]) → dict[str, float] | None

Build a vocabulary from the documents.

Parameters:: tokens (list[str]) – List of tokens.
Returns:: Dictionary with words and relative frequencies as keys and values respectively.
Return type:: dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.calculate_distance(first_token: str, vocabulary: dict[str, float], method: Literal['jaccard', 'frequency-based', 'levenshtein', 'jaro-winkler'], alphabet: list[str] | None = None) → dict[str, float] | None

Calculate distance between two strings using the specified method.

Parameters:

first_token (str) – First string to compare.
vocabulary (dict[str, float]) – Dictionary mapping words to their relative frequencies.
method (str) – Method to use for comparison.
alphabet (list[str]) – The alphabet with letters.

Returns:

Calculated distance score.

Return type:

dict[str, float] | None

In case of corrupt input arguments or unsupported method, None is returned.

lab_2_spellcheck.main.calculate_frequency_distance(word: str, frequencies: dict, alphabet: list[str]) → dict[str, float] | None

Suggest the most probable correct spelling for the word.

Parameters:

word (str) – The input incorrect word.
frequencies (dict) – A dictionary with frequencies.
alphabet (list[str]) – Alphabet for candidates creation.

Returns:

The most probable corrected word.

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.calculate_jaccard_distance(token: str, candidate: str) → float | None

Calculate Jaccard distance between two strings.

Parameters:

token (str) – First string to compare.
candidate (str) – Second string to compare.

Returns:

Jaccard distance score in range [0, 1].

Return type:

float | None

In case of corrupt input arguments, None is returned. In case of both strings being empty, 1.0 is returned.

lab_2_spellcheck.main.calculate_jaro_distance(token: str, candidate: str, matches: int, transpositions: int) → float | None

Calculate the Jaro distance between two strings.

Parameters:

token (str) – The first string to compare.
candidate (str) – The second string to compare.
matches (int) – Number of matching letters.
transpositions (int) – Number of transpositions.

Returns:

Jaro distance score.

Return type:

float | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.calculate_jaro_winkler_distance(token: str, candidate: str, prefix_scaling: float = 0.1) → float | None

Calculate the Jaro-Winkler distance between two strings.

Parameters:

token (str) – The first string.
candidate (str) – The second string.
prefix_scaling (float) – Scaling factor for the prefix boost.

Returns:

Jaro-Winkler distance score.

Return type:

float | None

In case of corrupt input arguments or corrupt outputs of used functions, None is returned.

lab_2_spellcheck.main.calculate_levenshtein_distance(token: str, candidate: str) → int | None

Calculate the Levenshtein edit distance between two strings.

Parameters:

token (str) – First string.
candidate (str) – Second string.

Returns:

Minimum number of single-character edits (insertions, deletions,: substitutions) required to transform token into candidate.

Return type:

int | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.count_transpositions(token: str, candidate: str, token_matches: list[bool], candidate_matches: list[bool]) → int | None

Count the number of transpositions between two strings based on matching letters.

Parameters:

token (str) – The first string to compare.
candidate (str) – The second string to compare.
token_matches (list[bool]) – Boolean list indicating matches in token.
candidate_matches (list[bool]) – Boolean list indicating matches in candidate.

Returns:

Number of transpositions.

Return type:

int | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.delete_letter(word: str) → list[str]

Generate all possible words by deleting one letter from the word.

Parameters:: word (str) – The input incorrect word.
Returns:: A sorted list of words with one letter removed at each position.
Return type:: list[str]

In case of corrupt input arguments, empty list is returned.

lab_2_spellcheck.main.fill_levenshtein_matrix(token: str, candidate: str) → list[list[int]] | None

Fill a Levenshtein matrix with edit distances between all prefixes.

Parameters:

token (str) – First string.
candidate (str) – Second string.

Returns:

Completed Levenshtein distance matrix.

Return type:

list[list[int]] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.find_correct_word(wrong_word: str, vocabulary: dict[str, float], method: Literal['jaccard', 'frequency-based', 'levenshtein', 'jaro-winkler'], alphabet: list[str] | None = None) → str | None

Find the most similar word from vocabulary using the specified method.

Parameters:

wrong_word (str) – Word that might be misspelled.
vocabulary (dict[str, float]) – Dict of candidate words.
method (str) – Method to use for comparison.
alphabet (list[str]) – The alphabet with letters.

Returns:

Word from vocabulary with the lowest distance score.: In case of ties, the closest in length and lexicographically first is chosen.

Return type:

str | None

In case of empty vocabulary, None is returned.

lab_2_spellcheck.main.find_out_of_vocab_words(tokens: list[str], vocabulary: dict[str, float]) → list[str] | None

Found words out of vocabulary.

Parameters:

tokens (list[str]) – List of tokens.
vocabulary (dict[str, float]) – Dictionary with unique words and their relative frequencies.

Returns:

List of incorrect words.

Return type:

list[str] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.generate_candidates(word: str, alphabet: list[str]) → list[str] | None

Generate all possible candidate words for a given word using four basic operations.

Parameters:

word (str) – The input word.
alphabet (list[str]) – Alphabet for candidates creation.

Returns:

A combined list of candidate words generated by all operations.

Return type:

list[str] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.get_matches(token: str, candidate: str, match_distance: int) → tuple[int, list[bool], list[bool]] | None

Find matching letters between two strings within a distance.

Parameters:

token (str) – The first string to compare.
candidate (str) – The second string to compare.
match_distance (int) – Maximum allowed offset for letters to be considered matching.

Returns:

Number of matching letters. Boolean list indicating matches in token. Boolean list indicating matches in candidate.

Return type:

tuple[int, list[bool], list[bool]]

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.initialize_levenshtein_matrix(token_length: int, candidate_length: int) → list[list[int]] | None

Initialize a 2D matrix for Levenshtein distance calculation.

Parameters:

token_length (int) – Length of the first string.
candidate_length (int) – Length of the second string.

Returns:

Initialized matrix with base cases filled.

Return type:

list[list[int]] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.propose_candidates(word: str, alphabet: list[str]) → tuple[str, ...] | None

Generate candidate words by applying single-edit operations (delete, add, replace, swap) to the word.

Parameters:

word (str) – The input incorrect word.
alphabet (list[str]) – Alphabet for candidates creation.

Returns:

A tuple of unique candidate words generated from the input.

Return type:

tuple[str] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.replace_letter(word: str, alphabet: list[str]) → list[str]

Generate all possible words by replacing each letter in the word with letters from the alphabet.

Parameters:

word (str) – The input incorrect word.
alphabet (list[str]) – The alphabet with letters.

Returns:

A sorted list of words with one letter replaced at each position.

Return type:

list[str]

In case of corrupt input arguments, empty list is returned.

lab_2_spellcheck.main.swap_adjacent(word: str) → list[str]

Generate all possible words by swapping each pair of adjacent letters in the word.

Parameters:: word (str) – The input incorrect word.
Returns:: A sorted list of words where two neighboring letters are swapped.
Return type:: list[str]

In case of corrupt input arguments, empty list is returned.

lab_2_spellcheck.main.winkler_adjustment(token: str, candidate: str, jaro_distance: float, prefix_scaling: float = 0.1) → float | None

Apply the Winkler adjustment to boost distance for strings with a common prefix.

Parameters:

token (str) – The first string to compare.
candidate (str) – The second string to compare.
jaro_distance (float) – Jaro distance score.
prefix_scaling (float) – Scaling factor for the prefix boost.

Returns:

Winkler adjustment score.

Return type:

float | None

In case of corrupt input arguments, None is returned.