lab_2_spellcheck package

Submodules

Lab 2.

lab_2_spellcheck.main.add_letter(word: str, alphabet: list[str]) list[str]

Generate all possible words by inserting a letter from the alphabet at every possible position in the word.

Parameters:
  • word (str) – The input incorrect word.

  • alphabet (list[str]) – The alphabet with letters.

Returns:

A list of words with one additional letter inserted.

Return type:

list[str]

In case of corrupt input arguments, empty list is returned.

lab_2_spellcheck.main.build_vocabulary(tokens: list[str]) dict[str, float] | None

Build a vocabulary from the documents.

Parameters:

tokens (list[str]) – List of tokens.

Returns:

Dictionary with words and relative frequencies as keys and values respectively.

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.calculate_distance(first_token: str, vocabulary: dict[str, float], method: Literal['jaccard', 'frequency-based', 'levenshtein', 'jaro-winkler'], alphabet: list[str] | None = None) dict[str, float] | None

Calculate distance between two strings using the specified method.

Parameters:
  • first_token (str) – First string to compare.

  • vocabulary (dict[str, float]) – Dictionary mapping words to their relative frequencies.

  • method (str) – Method to use for comparison.

  • alphabet (list[str]) – The alphabet with letters.

Returns:

Calculated distance score.

Return type:

dict[str, float] | None

In case of corrupt input arguments or unsupported method, None is returned.

lab_2_spellcheck.main.calculate_frequency_distance(word: str, frequencies: dict, alphabet: list[str]) dict[str, float] | None

Suggest the most probable correct spelling for the word.

Parameters:
  • word (str) – The input incorrect word.

  • frequencies (dict) – A dictionary with frequencies.

  • alphabet (list[str]) – Alphabet for candidates creation.

Returns:

The most probable corrected word.

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.calculate_jaccard_distance(token: str, candidate: str) float | None

Calculate Jaccard distance between two strings.

Parameters:
  • token (str) – First string to compare.

  • candidate (str) – Second string to compare.

Returns:

Jaccard distance score in range [0, 1].

Return type:

float | None

In case of corrupt input arguments, None is returned. In case of both strings being empty, 1.0 is returned.

lab_2_spellcheck.main.calculate_jaro_distance(token: str, candidate: str, matches: int, transpositions: int) float | None

Calculate the Jaro distance between two strings.

Parameters:
  • token (str) – The first string to compare.

  • candidate (str) – The second string to compare.

  • matches (int) – Number of matching letters.

  • transpositions (int) – Number of transpositions.

Returns:

Jaro distance score.

Return type:

float | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.calculate_jaro_winkler_distance(token: str, candidate: str, prefix_scaling: float = 0.1) float | None

Calculate the Jaro-Winkler distance between two strings.

Parameters:
  • token (str) – The first string.

  • candidate (str) – The second string.

  • prefix_scaling (float) – Scaling factor for the prefix boost.

Returns:

Jaro-Winkler distance score.

Return type:

float | None

In case of corrupt input arguments or corrupt outputs of used functions, None is returned.

lab_2_spellcheck.main.calculate_levenshtein_distance(token: str, candidate: str) int | None

Calculate the Levenshtein edit distance between two strings.

Parameters:
  • token (str) – First string.

  • candidate (str) – Second string.

Returns:

Minimum number of single-character edits (insertions, deletions,

substitutions) required to transform token into candidate.

Return type:

int | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.count_transpositions(token: str, candidate: str, token_matches: list[bool], candidate_matches: list[bool]) int | None

Count the number of transpositions between two strings based on matching letters.

Parameters:
  • token (str) – The first string to compare.

  • candidate (str) – The second string to compare.

  • token_matches (list[bool]) – Boolean list indicating matches in token.

  • candidate_matches (list[bool]) – Boolean list indicating matches in candidate.

Returns:

Number of transpositions.

Return type:

int | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.delete_letter(word: str) list[str]

Generate all possible words by deleting one letter from the word.

Parameters:

word (str) – The input incorrect word.

Returns:

A sorted list of words with one letter removed at each position.

Return type:

list[str]

In case of corrupt input arguments, empty list is returned.

lab_2_spellcheck.main.fill_levenshtein_matrix(token: str, candidate: str) list[list[int]] | None

Fill a Levenshtein matrix with edit distances between all prefixes.

Parameters:
  • token (str) – First string.

  • candidate (str) – Second string.

Returns:

Completed Levenshtein distance matrix.

Return type:

list[list[int]] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.find_correct_word(wrong_word: str, vocabulary: dict[str, float], method: Literal['jaccard', 'frequency-based', 'levenshtein', 'jaro-winkler'], alphabet: list[str] | None = None) str | None

Find the most similar word from vocabulary using the specified method.

Parameters:
  • wrong_word (str) – Word that might be misspelled.

  • vocabulary (dict[str, float]) – Dict of candidate words.

  • method (str) – Method to use for comparison.

  • alphabet (list[str]) – The alphabet with letters.

Returns:

Word from vocabulary with the lowest distance score.

In case of ties, the closest in length and lexicographically first is chosen.

Return type:

str | None

In case of empty vocabulary, None is returned.

lab_2_spellcheck.main.find_out_of_vocab_words(tokens: list[str], vocabulary: dict[str, float]) list[str] | None

Found words out of vocabulary.

Parameters:
  • tokens (list[str]) – List of tokens.

  • vocabulary (dict[str, float]) – Dictionary with unique words and their relative frequencies.

Returns:

List of incorrect words.

Return type:

list[str] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.generate_candidates(word: str, alphabet: list[str]) list[str] | None

Generate all possible candidate words for a given word using four basic operations.

Parameters:
  • word (str) – The input word.

  • alphabet (list[str]) – Alphabet for candidates creation.

Returns:

A combined list of candidate words generated by all operations.

Return type:

list[str] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.get_matches(token: str, candidate: str, match_distance: int) tuple[int, list[bool], list[bool]] | None

Find matching letters between two strings within a distance.

Parameters:
  • token (str) – The first string to compare.

  • candidate (str) – The second string to compare.

  • match_distance (int) – Maximum allowed offset for letters to be considered matching.

Returns:

Number of matching letters. Boolean list indicating matches in token. Boolean list indicating matches in candidate.

Return type:

tuple[int, list[bool], list[bool]]

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.initialize_levenshtein_matrix(token_length: int, candidate_length: int) list[list[int]] | None

Initialize a 2D matrix for Levenshtein distance calculation.

Parameters:
  • token_length (int) – Length of the first string.

  • candidate_length (int) – Length of the second string.

Returns:

Initialized matrix with base cases filled.

Return type:

list[list[int]] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.propose_candidates(word: str, alphabet: list[str]) tuple[str, ...] | None

Generate candidate words by applying single-edit operations (delete, add, replace, swap) to the word.

Parameters:
  • word (str) – The input incorrect word.

  • alphabet (list[str]) – Alphabet for candidates creation.

Returns:

A tuple of unique candidate words generated from the input.

Return type:

tuple[str] | None

In case of corrupt input arguments, None is returned.

lab_2_spellcheck.main.replace_letter(word: str, alphabet: list[str]) list[str]

Generate all possible words by replacing each letter in the word with letters from the alphabet.

Parameters:
  • word (str) – The input incorrect word.

  • alphabet (list[str]) – The alphabet with letters.

Returns:

A sorted list of words with one letter replaced at each position.

Return type:

list[str]

In case of corrupt input arguments, empty list is returned.

lab_2_spellcheck.main.swap_adjacent(word: str) list[str]

Generate all possible words by swapping each pair of adjacent letters in the word.

Parameters:

word (str) – The input incorrect word.

Returns:

A sorted list of words where two neighboring letters are swapped.

Return type:

list[str]

In case of corrupt input arguments, empty list is returned.

lab_2_spellcheck.main.winkler_adjustment(token: str, candidate: str, jaro_distance: float, prefix_scaling: float = 0.1) float | None

Apply the Winkler adjustment to boost distance for strings with a common prefix.

Parameters:
  • token (str) – The first string to compare.

  • candidate (str) – The second string to compare.

  • jaro_distance (float) – Jaro distance score.

  • prefix_scaling (float) – Scaling factor for the prefix boost.

Returns:

Winkler adjustment score.

Return type:

float | None

In case of corrupt input arguments, None is returned.