lab_2_retrieval_w_bm25 package
Submodules
Lab 2.
Text retrieval with BM25
- lab_2_retrieval_w_bm25.main.build_vocabulary(documents: list[list[str]]) list[str] | None
Build a vocabulary from the documents.
- Parameters:
- Returns:
List with unique words from the documents.
- Return type:
In case of corrupt input arguments, None is returned.
- lab_2_retrieval_w_bm25.main.calculate_bm25(vocab: list[str], document: list[str], idf_document: dict[str, float], k1: float = 1.5, b: float = 0.75, avg_doc_len: float | None = None, doc_len: int | None = None) dict[str, float] | None
Calculate BM25 scores for a document.
- Parameters:
- Returns:
Mapping from terms to their BM25 scores.
- Return type:
In case of corrupt input arguments, None is returned.
- lab_2_retrieval_w_bm25.main.calculate_bm25_with_cutoff(vocab: list[str], document: list[str], idf_document: dict[str, float], alpha: float, k1: float = 1.5, b: float = 0.75, avg_doc_len: float | None = None, doc_len: int | None = None) dict[str, float] | None
Calculate BM25 scores for a document with IDF cutoff.
- Parameters:
- Returns:
Mapping from terms to their BM25 scores with cutoff applied.
- Return type:
In case of corrupt input arguments, None is returned.
- lab_2_retrieval_w_bm25.main.calculate_idf(vocab: list[str], documents: list[list[str]]) dict[str, float] | None
Calculate inverse document frequency for each term in the vocabulary.
- Parameters:
- Returns:
Mapping from vocabulary terms to its IDF scores.
- Return type:
In case of corrupt input arguments, None is returned.
- lab_2_retrieval_w_bm25.main.calculate_spearman(rank: list[int], golden_rank: list[int]) float | None
Calculate Spearman’s rank correlation coefficient between two rankings.
- Parameters:
- Returns:
Spearman’s rank correlation coefficient.
- Return type:
float | None
In case of corrupt input arguments, None is returned.
- lab_2_retrieval_w_bm25.main.calculate_tf(vocab: list[str], document_tokens: list[str]) dict[str, float] | None
Calculate term frequency for the given tokens based on the vocabulary.
- Parameters:
- Returns:
Mapping from vocabulary terms to their term frequency.
- Return type:
In case of corrupt input arguments, None is returned.
- lab_2_retrieval_w_bm25.main.calculate_tf_idf(tf: dict[str, float], idf: dict[str, float]) dict[str, float] | None
Calculate TF-IDF scores for a document.
- Parameters:
- Returns:
Mapping from terms to their TF-IDF scores.
- Return type:
In case of corrupt input arguments, None is returned.
- lab_2_retrieval_w_bm25.main.load_index(file_path: str) list[dict[str, float]] | None
Load the index from a file.
- Parameters:
file_path (str) – The path to the file from which to load the index.
- Returns:
The loaded index.
- Return type:
In case of corrupt input arguments, None is returned.
- lab_2_retrieval_w_bm25.main.rank_documents(indexes: list[dict[str, float]], query: str, stopwords: list[str]) list[tuple[int, float]] | None
Rank documents for the given query.
- Parameters:
- Returns:
Tuples of document index and its score in the ranking.
- Return type:
In case of corrupt input arguments, None is returned.
- lab_2_retrieval_w_bm25.main.remove_stopwords(tokens: list[str], stopwords: list[str]) list[str] | None
Remove stopwords from the list of tokens.
- Parameters:
- Returns:
Tokens after removing stopwords.
- Return type:
In case of corrupt input arguments, None is returned.
- lab_2_retrieval_w_bm25.main.save_index(index: list[dict[str, float]], file_path: str) None
Save the index to a file.
- lab_2_retrieval_w_bm25.main.tokenize(text: str) list[str] | None
Tokenize the input text into lowercase words without punctuation, digits and other symbols.
- Parameters:
text (str) – The input text to tokenize.
- Returns:
A list of words from the text.
- Return type:
In case of corrupt input arguments, None is returned.