lab_1_classify_by_unigrams package

Submodules

Lab 1.

Language detection

lab_1_classify_by_unigrams.main.calculate_frequencies(tokens: list[str] | None) dict[str, float] | None

Calculate frequencies of given tokens.

Parameters:

tokens (list[str] | None) – A list of tokens

Returns:

A dictionary with frequencies

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.calculate_mse(predicted: list, actual: list) float | None

Calculate mean squared error between predicted and actual values.

Parameters:
  • predicted (list) – A list of predicted values

  • actual (list) – A list of actual values

Returns:

The score

Return type:

float | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.collect_profiles(paths_to_profiles: list) list[dict[str, str | dict[str, float]]] | None

Collect profiles for a given path.

Parameters:

paths_to_profiles (list) – A list of strings to the profiles

Returns:

A list of loaded profiles

Return type:

list[dict[str, str | dict[str, float]]] | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.compare_profiles(unknown_profile: dict[str, str | dict[str, float]], profile_to_compare: dict[str, str | dict[str, float]]) float | None

Compare profiles and calculate the distance using symbols.

Parameters:
Returns:

The distance between the profiles

Return type:

float | None

In case of corrupt input arguments or lack of keys ‘name’ and ‘freq’ in arguments, None is returned

lab_1_classify_by_unigrams.main.create_language_profile(language: str, text: str) dict[str, str | dict[str, float]] | None

Create a language profile.

Parameters:
  • language (str) – A language

  • text (str) – A text

Returns:

A dictionary with two keys – name, freq

Return type:

dict[str, str | dict[str, float]] | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.detect_language(unknown_profile: dict[str, str | dict[str, float]], profile_1: dict[str, str | dict[str, float]], profile_2: dict[str, str | dict[str, float]]) str | None

Detect the language of an unknown profile.

Parameters:
Returns:

A language

Return type:

str | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.detect_language_advanced(unknown_profile: dict[str, str | dict[str, float]], known_profiles: list) list | None

Detect the language of an unknown profile.

Parameters:
  • unknown_profile (dict[str, str | dict[str, float]]) – A dictionary of a profile to determine the language of

  • known_profiles (list) – A list of known profiles

Returns:

A sorted list of tuples containing a language and a distance

Return type:

list | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.load_profile(path_to_file: str) dict | None

Load a language profile.

Parameters:

path_to_file (str) – A path to the language profile

Returns:

A dictionary with at least two keys – name, freq

Return type:

dict | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.preprocess_profile(profile: dict) dict[str, str | dict] | None

Preprocess profile for a loaded language.

Parameters:

profile (dict) – A loaded profile

Returns:

A dict with a lower-cased loaded profile

with relative frequencies without unnecessary n-grams

Return type:

dict[str, str | dict] | None

In case of corrupt input arguments or lack of keys ‘name’, ‘n_words’ and ‘freq’ in arguments, None is returned

lab_1_classify_by_unigrams.main.print_report(detections: list[tuple[str, float]]) None

Print report for detection of language.

Parameters:

detections (list[tuple[str, float]]) – A list with distances for each available language

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.tokenize(text: str) list[str] | None

Split a text into tokens.

Convert the tokens into lowercase, remove punctuation, digits and other symbols

Parameters:

text (str) – A text

Returns:

A list of lower-cased tokens without punctuation

Return type:

list[str] | None

In case of corrupt input arguments, None is returned