Submodules
Lab 1
Extract keywords based on frequency related metrics
-
lab_1_keywords_tfidf.main.calculate_chi_values(expected: dict[str, float], observed: dict[str, int]) → dict[str, float] | None
Calculate chi-squared values for tokens.
- Parameters:
-
- Returns:
Dictionary with chi-squared values.
In case of corrupt input arguments, None is returned.
- Return type:
dict[str, float] | None
-
lab_1_keywords_tfidf.main.calculate_expected_frequency(doc_freqs: dict[str, int], corpus_freqs: dict[str, int]) → dict[str, float] | None
Calculate expected frequency for tokens based on document and corpus frequencies.
- Parameters:
doc_freqs (dict[str, int]) – Token frequencies in document
corpus_freqs (dict[str, int]) – Token frequencies in corpus
- Returns:
Dictionary with expected frequencies.
In case of corrupt input arguments, None is returned.
- Return type:
dict[str, float] | None
-
lab_1_keywords_tfidf.main.calculate_frequencies(tokens: list[str]) → dict[str, int] | None
Create a frequency dictionary from the token sequence.
- Parameters:
tokens (list[str]) – Token sequence
- Returns:
A dictionary {token: occurrences}.
In case of corrupt input arguments, None is returned.
- Return type:
dict[str, int] | None
-
lab_1_keywords_tfidf.main.calculate_tf(frequencies: dict[str, int]) → dict[str, float] | None
Calculate Term Frequency (TF) for each token.
- Parameters:
frequencies (dict[str, int]) – Raw occurrences of tokens
- Returns:
Dictionary with tokens and TF values.
In case of corrupt input arguments, None is returned.
- Return type:
dict[str, float] | None
-
lab_1_keywords_tfidf.main.calculate_tfidf(term_freq: dict[str, float], idf: dict[str, float]) → dict[str, float] | None
Calculate TF-IDF score for tokens.
- Parameters:
-
- Returns:
Dictionary with tokens and TF-IDF values.
In case of corrupt input arguments, None is returned.
- Return type:
dict[str, float] | None
-
lab_1_keywords_tfidf.main.check_dict(user_input: Any, key_type: type, value_type: type, can_be_empty: bool) → bool
Check if the object is a dictionary with keys and values of given types.
- Parameters:
user_input (Any) – Object to check
key_type (type) – Expected type of dictionary keys
value_type (type) – Expected type of dictionary values
can_be_empty (bool) – Whether an empty dictionary is allowed
- Returns:
True if valid, False otherwise
- Return type:
bool
-
lab_1_keywords_tfidf.main.check_float(user_input: Any) → bool
Check if the object is a float.
- Parameters:
user_input (Any) – Object to check
- Returns:
True if valid, False otherwise
- Return type:
bool
-
lab_1_keywords_tfidf.main.check_list(user_input: Any, elements_type: type, can_be_empty: bool) → bool
Check if the object is a list containing elements of a certain type.
- Parameters:
user_input (Any) – Object to check
elements_type (type) – Expected type of list elements
can_be_empty (bool) – Whether an empty list is allowed
- Returns:
True if valid, False otherwise
- Return type:
bool
-
lab_1_keywords_tfidf.main.check_positive_int(user_input: Any) → bool
Check if the object is a positive integer (not bool).
- Parameters:
user_input (Any) – Object to check
- Returns:
True if valid, False otherwise
- Return type:
bool
-
lab_1_keywords_tfidf.main.clean_and_tokenize(text: str) → list[str] | None
Remove punctuation, convert to lowercase, and split into tokens.
- Parameters:
text (str) – Original text
- Returns:
A list of lowercase tokens without punctuation.
In case of corrupt input arguments, None is returned.
- Return type:
list[str] | None
-
lab_1_keywords_tfidf.main.extract_significant_words(chi_values: dict[str, float], alpha: float) → dict[str, float] | None
Select tokens with chi-squared values greater than the critical threshold.
- Parameters:
chi_values (dict[str, float]) – Dictionary with chi-squared values
alpha (float) – Significance level controlling chi-squared threshold
- Returns:
Dictionary with significant tokens.
In case of corrupt input arguments, None is returned.
- Return type:
dict[str, float] | None
-
lab_1_keywords_tfidf.main.get_top_n(frequencies: dict[str, int | float], top: int) → list[str] | None
Extract the most frequent tokens.
- Parameters:
frequencies (dict[str, int | float]) – A dictionary with tokens and their frequencies
top (int) – Number of tokens to extract
- Returns:
Top-N tokens sorted by frequency.
In case of corrupt input arguments, None is returned.
- Return type:
list[str] | None
-
lab_1_keywords_tfidf.main.remove_stop_words(tokens: list[str], stop_words: list[str]) → list[str] | None
Exclude stop words from the token sequence.
- Parameters:
-
- Returns:
Token sequence without stop words.
In case of corrupt input arguments, None is returned.
- Return type:
list[str] | None