lab_2_retrieval_w_bm25 package

Submodules

Lab 2.

Text retrieval with BM25

lab_2_retrieval_w_bm25.main.build_vocabulary(documents: list[list[str]]) → list[str] | None

Build a vocabulary from the documents.

Parameters:: documents (list[list[str]]) – List of tokenized documents.
Returns:: List with unique words from the documents.
Return type:: list[str] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.calculate_bm25(vocab: list[str], document: list[str], idf_document: dict[str, float], k1: float = 1.5, b: float = 0.75, avg_doc_len: float | None = None, doc_len: int | None = None) → dict[str, float] | None

Calculate BM25 scores for a document.

Parameters:

vocab (list[str]) – Vocabulary list.
document (list[str]) – Tokenized document.
idf_document (dict[str, float]) – Inverse document frequencies.
k1 (float) – BM25 parameter.
b (float) – BM25 parameter.
avg_doc_len (float | None) – Average document length.
doc_len (int | None) – Length of the document.

Returns:

Mapping from terms to their BM25 scores.

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.calculate_bm25_with_cutoff(vocab: list[str], document: list[str], idf_document: dict[str, float], alpha: float, k1: float = 1.5, b: float = 0.75, avg_doc_len: float | None = None, doc_len: int | None = None) → dict[str, float] | None

Calculate BM25 scores for a document with IDF cutoff.

Parameters:

vocab (list[str]) – Vocabulary list.
document (list[str]) – Tokenized document.
idf_document (dict[str, float]) – Inverse document frequencies.
alpha (float) – IDF cutoff threshold.
k1 (float) – BM25 parameter.
b (float) – BM25 parameter.
avg_doc_len (float | None) – Average document length.
doc_len (int | None) – Length of the document.

Returns:

Mapping from terms to their BM25 scores with cutoff applied.

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.calculate_idf(vocab: list[str], documents: list[list[str]]) → dict[str, float] | None

Calculate inverse document frequency for each term in the vocabulary.

Parameters:

vocab (list[str]) – Vocabulary list.
documents (list[list[str]]) – List of tokenized documents.

Returns:

Mapping from vocabulary terms to its IDF scores.

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.calculate_spearman(rank: list[int], golden_rank: list[int]) → float | None

Calculate Spearman’s rank correlation coefficient between two rankings.

Parameters:

rank (list[int]) – Ranked list of document indices.
golden_rank (list[int]) – Golden ranked list of document indices.

Returns:

Spearman’s rank correlation coefficient.

Return type:

float | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.calculate_tf(vocab: list[str], document_tokens: list[str]) → dict[str, float] | None

Calculate term frequency for the given tokens based on the vocabulary.

Parameters:

vocab (list[str]) – Vocabulary list.
document_tokens (list[str]) – Tokenized document.

Returns:

Mapping from vocabulary terms to their term frequency.

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.calculate_tf_idf(tf: dict[str, float], idf: dict[str, float]) → dict[str, float] | None

Calculate TF-IDF scores for a document.

Parameters:

tf (dict[str, float]) – Term frequencies for the document.
idf (dict[str, float]) – Inverse document frequencies.

Returns:

Mapping from terms to their TF-IDF scores.

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.load_index(file_path: str) → list[dict[str, float]] | None

Load the index from a file.

Parameters:: file_path (str) – The path to the file from which to load the index.
Returns:: The loaded index.
Return type:: list[dict[str, float]] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.rank_documents(indexes: list[dict[str, float]], query: str, stopwords: list[str]) → list[tuple[int, float]] | None

Rank documents for the given query.

Parameters:

indexes (list[dict[str, float]]) – List of BM25 or TF-IDF indexes for the documents.
query (str) – The query string.
stopwords (list[str]) – List of stopwords.

Returns:

Tuples of document index and its score in the ranking.

Return type:

list[tuple[int, float]] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.remove_stopwords(tokens: list[str], stopwords: list[str]) → list[str] | None

Remove stopwords from the list of tokens.

Parameters:

tokens (list[str]) – List of tokens.
stopwords (list[str]) – List of stopwords.

Returns:

Tokens after removing stopwords.

Return type:

list[str] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.save_index(index: list[dict[str, float]], file_path: str) → None

Save the index to a file.

Parameters:

index (list[dict[str, float]]) – The index to save.
file_path (str) – The path to the file where the index will be saved.

lab_2_retrieval_w_bm25.main.tokenize(text: str) → list[str] | None

Tokenize the input text into lowercase words without punctuation, digits and other symbols.

Parameters:: text (str) – The input text to tokenize.
Returns:: A list of words from the text.
Return type:: list[str] | None

In case of corrupt input arguments, None is returned.