lab_1_classify_by_unigrams package
Submodules
Lab 1.
Language detection
- lab_1_classify_by_unigrams.main.calculate_frequencies(tokens: list[str] | None) dict[str, float] | None
Calculate frequencies of given tokens.
- Parameters:
- Returns:
A dictionary with frequencies
- Return type:
In case of corrupt input arguments, None is returned
- lab_1_classify_by_unigrams.main.calculate_mse(predicted: list, actual: list) float | None
Calculate mean squared error between predicted and actual values.
- Parameters:
- Returns:
The score
- Return type:
float | None
In case of corrupt input arguments, None is returned
- lab_1_classify_by_unigrams.main.collect_profiles(paths_to_profiles: list) list[dict[str, str | dict[str, float]]] | None
Collect profiles for a given path.
- Parameters:
paths_to_profiles (list) – A list of strings to the profiles
- Returns:
A list of loaded profiles
- Return type:
In case of corrupt input arguments, None is returned
- lab_1_classify_by_unigrams.main.compare_profiles(unknown_profile: dict[str, str | dict[str, float]], profile_to_compare: dict[str, str | dict[str, float]]) float | None
Compare profiles and calculate the distance using symbols.
- Parameters:
- Returns:
The distance between the profiles
- Return type:
float | None
In case of corrupt input arguments or lack of keys ‘name’ and ‘freq’ in arguments, None is returned
- lab_1_classify_by_unigrams.main.create_language_profile(language: str, text: str) dict[str, str | dict[str, float]] | None
Create a language profile.
- Parameters:
- Returns:
A dictionary with two keys – name, freq
- Return type:
In case of corrupt input arguments, None is returned
- lab_1_classify_by_unigrams.main.detect_language(unknown_profile: dict[str, str | dict[str, float]], profile_1: dict[str, str | dict[str, float]], profile_2: dict[str, str | dict[str, float]]) str | None
Detect the language of an unknown profile.
- Parameters:
- Returns:
A language
- Return type:
str | None
In case of corrupt input arguments, None is returned
- lab_1_classify_by_unigrams.main.detect_language_advanced(unknown_profile: dict[str, str | dict[str, float]], known_profiles: list) list | None
Detect the language of an unknown profile.
- Parameters:
- Returns:
A sorted list of tuples containing a language and a distance
- Return type:
list | None
In case of corrupt input arguments, None is returned
- lab_1_classify_by_unigrams.main.load_profile(path_to_file: str) dict | None
Load a language profile.
- Parameters:
path_to_file (str) – A path to the language profile
- Returns:
A dictionary with at least two keys – name, freq
- Return type:
dict | None
In case of corrupt input arguments, None is returned
- lab_1_classify_by_unigrams.main.preprocess_profile(profile: dict) dict[str, str | dict] | None
Preprocess profile for a loaded language.
- Parameters:
profile (dict) – A loaded profile
- Returns:
- A dict with a lower-cased loaded profile
with relative frequencies without unnecessary n-grams
- Return type:
In case of corrupt input arguments or lack of keys ‘name’, ‘n_words’ and ‘freq’ in arguments, None is returned
- lab_1_classify_by_unigrams.main.print_report(detections: list[tuple[str, float]]) None
Print report for detection of language.
- Parameters:
detections (list[tuple[str, float]]) – A list with distances for each available language
In case of corrupt input arguments, None is returned
- lab_1_classify_by_unigrams.main.tokenize(text: str) list[str] | None
Split a text into tokens.
Convert the tokens into lowercase, remove punctuation, digits and other symbols
- Parameters:
text (str) – A text
- Returns:
A list of lower-cased tokens without punctuation
- Return type:
In case of corrupt input arguments, None is returned