lab_1_classify_by_unigrams package

Submodules

Lab 1.

Language detection

lab_1_classify_by_unigrams.main.calculate_frequencies(tokens: list[str] | None) → dict[str, float] | None

Calculate frequencies of given tokens.

Parameters:: tokens (list[str] | None) – A list of tokens
Returns:: A dictionary with frequencies
Return type:: dict[str, float] | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.calculate_mse(predicted: list, actual: list) → float | None

Calculate mean squared error between predicted and actual values.

Parameters:

predicted (list) – A list of predicted values
actual (list) – A list of actual values

Returns:

The score

Return type:

float | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.collect_profiles(paths_to_profiles: list) → list[dict[str, str | dict[str, float]]] | None

Collect profiles for a given path.

Parameters:: paths_to_profiles (list) – A list of strings to the profiles
Returns:: A list of loaded profiles
Return type:: list[dict[str, str | dict[str, float]]] | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.compare_profiles(unknown_profile: dict[str, str | dict[str, float]], profile_to_compare: dict[str, str | dict[str, float]]) → float | None

Compare profiles and calculate the distance using symbols.

Parameters:

unknown_profile (dict[str, str | dict[str, float]]) – A dictionary of an unknown profile
profile_to_compare (dict[str, str | dict[str, float]]) – A dictionary of a profile to compare the unknown profile to

Returns:

The distance between the profiles

Return type:

float | None

In case of corrupt input arguments or lack of keys ‘name’ and ‘freq’ in arguments, None is returned

lab_1_classify_by_unigrams.main.create_language_profile(language: str, text: str) → dict[str, str | dict[str, float]] | None

Create a language profile.

Parameters:

language (str) – A language
text (str) – A text

Returns:

A dictionary with two keys – name, freq

Return type:

dict[str, str | dict[str, float]] | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.detect_language(unknown_profile: dict[str, str | dict[str, float]], profile_1: dict[str, str | dict[str, float]], profile_2: dict[str, str | dict[str, float]]) → str | None

Detect the language of an unknown profile.

Parameters:

unknown_profile (dict[str, str | dict[str, float]]) – A dictionary of a profile to determine the language of
profile_1 (dict[str, str | dict[str, float]]) – A dictionary of a known profile
profile_2 (dict[str, str | dict[str, float]]) – A dictionary of a known profile

Returns:

A language

Return type:

str | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.detect_language_advanced(unknown_profile: dict[str, str | dict[str, float]], known_profiles: list) → list | None

Detect the language of an unknown profile.

Parameters:

unknown_profile (dict[str, str | dict[str, float]]) – A dictionary of a profile to determine the language of
known_profiles (list) – A list of known profiles

Returns:

A sorted list of tuples containing a language and a distance

Return type:

list | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.load_profile(path_to_file: str) → dict | None

Load a language profile.

Parameters:: path_to_file (str) – A path to the language profile
Returns:: A dictionary with at least two keys – name, freq
Return type:: dict | None

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.preprocess_profile(profile: dict) → dict[str, str | dict] | None

Preprocess profile for a loaded language.

Parameters:

profile (dict) – A loaded profile

Returns:

A dict with a lower-cased loaded profile: with relative frequencies without unnecessary n-grams

Return type:

dict[str, str | dict] | None

In case of corrupt input arguments or lack of keys ‘name’, ‘n_words’ and ‘freq’ in arguments, None is returned

lab_1_classify_by_unigrams.main.print_report(detections: list[tuple[str, float]]) → None

Print report for detection of language.

Parameters:: detections (list[tuple[str, float]]) – A list with distances for each available language

In case of corrupt input arguments, None is returned

lab_1_classify_by_unigrams.main.tokenize(text: str) → list[str] | None

Split a text into tokens.

Convert the tokens into lowercase, remove punctuation, digits and other symbols

Parameters:: text (str) – A text
Returns:: A list of lower-cased tokens without punctuation
Return type:: list[str] | None

In case of corrupt input arguments, None is returned