lab_1_keywords_tfidf package

Submodules

Lab 1

Extract keywords based on frequency related metrics

lab_1_keywords_tfidf.main.calculate_chi_values(expected: dict[str, float], observed: dict[str, int]) → dict[str, float] | None

Calculate chi-squared values for tokens.

Parameters:

expected (dict[str, float]) – Expected frequencies
observed (dict[str, int]) – Observed frequencies

Returns:

Dictionary with chi-squared values. In case of corrupt input arguments, None is returned.

Return type:

dict[str, float] | None

lab_1_keywords_tfidf.main.calculate_expected_frequency(doc_freqs: dict[str, int], corpus_freqs: dict[str, int]) → dict[str, float] | None

Calculate expected frequency for tokens based on document and corpus frequencies.

Parameters:

doc_freqs (dict[str, int]) – Token frequencies in document
corpus_freqs (dict[str, int]) – Token frequencies in corpus

Returns:

Dictionary with expected frequencies. In case of corrupt input arguments, None is returned.

Return type:

dict[str, float] | None

lab_1_keywords_tfidf.main.calculate_frequencies(tokens: list[str]) → dict[str, int] | None

Create a frequency dictionary from the token sequence.

Parameters:: tokens (list[str]) – Token sequence
Returns:: A dictionary {token: occurrences}. In case of corrupt input arguments, None is returned.
Return type:: dict[str, int] | None

lab_1_keywords_tfidf.main.calculate_tf(frequencies: dict[str, int]) → dict[str, float] | None

Calculate Term Frequency (TF) for each token.

Parameters:: frequencies (dict[str, int]) – Raw occurrences of tokens
Returns:: Dictionary with tokens and TF values. In case of corrupt input arguments, None is returned.
Return type:: dict[str, float] | None

lab_1_keywords_tfidf.main.calculate_tfidf(term_freq: dict[str, float], idf: dict[str, float]) → dict[str, float] | None

Calculate TF-IDF score for tokens.

Parameters:

term_freq (dict[str, float]) – Term frequency values
idf (dict[str, float]) – Inverse document frequency values

Returns:

Dictionary with tokens and TF-IDF values. In case of corrupt input arguments, None is returned.

Return type:

dict[str, float] | None

lab_1_keywords_tfidf.main.check_dict(user_input: Any, key_type: type, value_type: type, can_be_empty: bool) → bool

Check if the object is a dictionary with keys and values of given types.

Parameters:

user_input (Any) – Object to check
key_type (type) – Expected type of dictionary keys
value_type (type) – Expected type of dictionary values
can_be_empty (bool) – Whether an empty dictionary is allowed

Returns:

True if valid, False otherwise

Return type:

bool

lab_1_keywords_tfidf.main.check_float(user_input: Any) → bool

Check if the object is a float.

Parameters:: user_input (Any) – Object to check
Returns:: True if valid, False otherwise
Return type:: bool

lab_1_keywords_tfidf.main.check_list(user_input: Any, elements_type: type, can_be_empty: bool) → bool

Check if the object is a list containing elements of a certain type.

Parameters:

user_input (Any) – Object to check
elements_type (type) – Expected type of list elements
can_be_empty (bool) – Whether an empty list is allowed

Returns:

True if valid, False otherwise

Return type:

bool

lab_1_keywords_tfidf.main.check_positive_int(user_input: Any) → bool

Check if the object is a positive integer (not bool).

Parameters:: user_input (Any) – Object to check
Returns:: True if valid, False otherwise
Return type:: bool

lab_1_keywords_tfidf.main.clean_and_tokenize(text: str) → list[str] | None

Remove punctuation, convert to lowercase, and split into tokens.

Parameters:: text (str) – Original text
Returns:: A list of lowercase tokens without punctuation. In case of corrupt input arguments, None is returned.
Return type:: list[str] | None

lab_1_keywords_tfidf.main.extract_significant_words(chi_values: dict[str, float], alpha: float) → dict[str, float] | None

Select tokens with chi-squared values greater than the critical threshold.

Parameters:

chi_values (dict[str, float]) – Dictionary with chi-squared values
alpha (float) – Significance level controlling chi-squared threshold

Returns:

Dictionary with significant tokens. In case of corrupt input arguments, None is returned.

Return type:

dict[str, float] | None

lab_1_keywords_tfidf.main.get_top_n(frequencies: dict[str, int | float], top: int) → list[str] | None

Extract the most frequent tokens.

Parameters:

frequencies (dict[str, int | float]) – A dictionary with tokens and their frequencies
top (int) – Number of tokens to extract

Returns:

Top-N tokens sorted by frequency. In case of corrupt input arguments, None is returned.

Return type:

list[str] | None

lab_1_keywords_tfidf.main.remove_stop_words(tokens: list[str], stop_words: list[str]) → list[str] | None

Exclude stop words from the token sequence.

Parameters:

tokens (list[str]) – Original token sequence
stop_words (list[str]) – Tokens to exclude

Returns:

Token sequence without stop words. In case of corrupt input arguments, None is returned.

Return type:

list[str] | None