lab_1_keywords_tfidf package

Submodules

Lab 1

Extract keywords based on frequency related metrics

lab_1_keywords_tfidf.main.calculate_chi_values(expected: dict[str, float], observed: dict[str, int]) dict[str, float] | None

Calculate chi-squared values for tokens.

Parameters:
Returns:

Dictionary with chi-squared values. In case of corrupt input arguments, None is returned.

Return type:

dict[str, float] | None

lab_1_keywords_tfidf.main.calculate_expected_frequency(doc_freqs: dict[str, int], corpus_freqs: dict[str, int]) dict[str, float] | None

Calculate expected frequency for tokens based on document and corpus frequencies.

Parameters:
  • doc_freqs (dict[str, int]) – Token frequencies in document

  • corpus_freqs (dict[str, int]) – Token frequencies in corpus

Returns:

Dictionary with expected frequencies. In case of corrupt input arguments, None is returned.

Return type:

dict[str, float] | None

lab_1_keywords_tfidf.main.calculate_frequencies(tokens: list[str]) dict[str, int] | None

Create a frequency dictionary from the token sequence.

Parameters:

tokens (list[str]) – Token sequence

Returns:

A dictionary {token: occurrences}. In case of corrupt input arguments, None is returned.

Return type:

dict[str, int] | None

lab_1_keywords_tfidf.main.calculate_tf(frequencies: dict[str, int]) dict[str, float] | None

Calculate Term Frequency (TF) for each token.

Parameters:

frequencies (dict[str, int]) – Raw occurrences of tokens

Returns:

Dictionary with tokens and TF values. In case of corrupt input arguments, None is returned.

Return type:

dict[str, float] | None

lab_1_keywords_tfidf.main.calculate_tfidf(term_freq: dict[str, float], idf: dict[str, float]) dict[str, float] | None

Calculate TF-IDF score for tokens.

Parameters:
Returns:

Dictionary with tokens and TF-IDF values. In case of corrupt input arguments, None is returned.

Return type:

dict[str, float] | None

lab_1_keywords_tfidf.main.check_dict(user_input: Any, key_type: type, value_type: type, can_be_empty: bool) bool

Check if the object is a dictionary with keys and values of given types.

Parameters:
  • user_input (Any) – Object to check

  • key_type (type) – Expected type of dictionary keys

  • value_type (type) – Expected type of dictionary values

  • can_be_empty (bool) – Whether an empty dictionary is allowed

Returns:

True if valid, False otherwise

Return type:

bool

lab_1_keywords_tfidf.main.check_float(user_input: Any) bool

Check if the object is a float.

Parameters:

user_input (Any) – Object to check

Returns:

True if valid, False otherwise

Return type:

bool

lab_1_keywords_tfidf.main.check_list(user_input: Any, elements_type: type, can_be_empty: bool) bool

Check if the object is a list containing elements of a certain type.

Parameters:
  • user_input (Any) – Object to check

  • elements_type (type) – Expected type of list elements

  • can_be_empty (bool) – Whether an empty list is allowed

Returns:

True if valid, False otherwise

Return type:

bool

lab_1_keywords_tfidf.main.check_positive_int(user_input: Any) bool

Check if the object is a positive integer (not bool).

Parameters:

user_input (Any) – Object to check

Returns:

True if valid, False otherwise

Return type:

bool

lab_1_keywords_tfidf.main.clean_and_tokenize(text: str) list[str] | None

Remove punctuation, convert to lowercase, and split into tokens.

Parameters:

text (str) – Original text

Returns:

A list of lowercase tokens without punctuation. In case of corrupt input arguments, None is returned.

Return type:

list[str] | None

lab_1_keywords_tfidf.main.extract_significant_words(chi_values: dict[str, float], alpha: float) dict[str, float] | None

Select tokens with chi-squared values greater than the critical threshold.

Parameters:
  • chi_values (dict[str, float]) – Dictionary with chi-squared values

  • alpha (float) – Significance level controlling chi-squared threshold

Returns:

Dictionary with significant tokens. In case of corrupt input arguments, None is returned.

Return type:

dict[str, float] | None

lab_1_keywords_tfidf.main.get_top_n(frequencies: dict[str, int | float], top: int) list[str] | None

Extract the most frequent tokens.

Parameters:
  • frequencies (dict[str, int | float]) – A dictionary with tokens and their frequencies

  • top (int) – Number of tokens to extract

Returns:

Top-N tokens sorted by frequency. In case of corrupt input arguments, None is returned.

Return type:

list[str] | None

lab_1_keywords_tfidf.main.remove_stop_words(tokens: list[str], stop_words: list[str]) list[str] | None

Exclude stop words from the token sequence.

Parameters:
  • tokens (list[str]) – Original token sequence

  • stop_words (list[str]) – Tokens to exclude

Returns:

Token sequence without stop words. In case of corrupt input arguments, None is returned.

Return type:

list[str] | None