lab_2_retrieval_w_bm25 package

Submodules

Lab 2.

Text retrieval with BM25

lab_2_retrieval_w_bm25.main.build_vocabulary(documents: list[list[str]]) list[str] | None

Build a vocabulary from the documents.

Parameters:

documents (list[list[str]]) – List of tokenized documents.

Returns:

List with unique words from the documents.

Return type:

list[str] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.calculate_bm25(vocab: list[str], document: list[str], idf_document: dict[str, float], k1: float = 1.5, b: float = 0.75, avg_doc_len: float | None = None, doc_len: int | None = None) dict[str, float] | None

Calculate BM25 scores for a document.

Parameters:
  • vocab (list[str]) – Vocabulary list.

  • document (list[str]) – Tokenized document.

  • idf_document (dict[str, float]) – Inverse document frequencies.

  • k1 (float) – BM25 parameter.

  • b (float) – BM25 parameter.

  • avg_doc_len (float | None) – Average document length.

  • doc_len (int | None) – Length of the document.

Returns:

Mapping from terms to their BM25 scores.

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.calculate_bm25_with_cutoff(vocab: list[str], document: list[str], idf_document: dict[str, float], alpha: float, k1: float = 1.5, b: float = 0.75, avg_doc_len: float | None = None, doc_len: int | None = None) dict[str, float] | None

Calculate BM25 scores for a document with IDF cutoff.

Parameters:
  • vocab (list[str]) – Vocabulary list.

  • document (list[str]) – Tokenized document.

  • idf_document (dict[str, float]) – Inverse document frequencies.

  • alpha (float) – IDF cutoff threshold.

  • k1 (float) – BM25 parameter.

  • b (float) – BM25 parameter.

  • avg_doc_len (float | None) – Average document length.

  • doc_len (int | None) – Length of the document.

Returns:

Mapping from terms to their BM25 scores with cutoff applied.

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.calculate_idf(vocab: list[str], documents: list[list[str]]) dict[str, float] | None

Calculate inverse document frequency for each term in the vocabulary.

Parameters:
  • vocab (list[str]) – Vocabulary list.

  • documents (list[list[str]]) – List of tokenized documents.

Returns:

Mapping from vocabulary terms to its IDF scores.

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.calculate_spearman(rank: list[int], golden_rank: list[int]) float | None

Calculate Spearman’s rank correlation coefficient between two rankings.

Parameters:
  • rank (list[int]) – Ranked list of document indices.

  • golden_rank (list[int]) – Golden ranked list of document indices.

Returns:

Spearman’s rank correlation coefficient.

Return type:

float | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.calculate_tf(vocab: list[str], document_tokens: list[str]) dict[str, float] | None

Calculate term frequency for the given tokens based on the vocabulary.

Parameters:
  • vocab (list[str]) – Vocabulary list.

  • document_tokens (list[str]) – Tokenized document.

Returns:

Mapping from vocabulary terms to their term frequency.

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.calculate_tf_idf(tf: dict[str, float], idf: dict[str, float]) dict[str, float] | None

Calculate TF-IDF scores for a document.

Parameters:
Returns:

Mapping from terms to their TF-IDF scores.

Return type:

dict[str, float] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.load_index(file_path: str) list[dict[str, float]] | None

Load the index from a file.

Parameters:

file_path (str) – The path to the file from which to load the index.

Returns:

The loaded index.

Return type:

list[dict[str, float]] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.rank_documents(indexes: list[dict[str, float]], query: str, stopwords: list[str]) list[tuple[int, float]] | None

Rank documents for the given query.

Parameters:
  • indexes (list[dict[str, float]]) – List of BM25 or TF-IDF indexes for the documents.

  • query (str) – The query string.

  • stopwords (list[str]) – List of stopwords.

Returns:

Tuples of document index and its score in the ranking.

Return type:

list[tuple[int, float]] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.remove_stopwords(tokens: list[str], stopwords: list[str]) list[str] | None

Remove stopwords from the list of tokens.

Parameters:
  • tokens (list[str]) – List of tokens.

  • stopwords (list[str]) – List of stopwords.

Returns:

Tokens after removing stopwords.

Return type:

list[str] | None

In case of corrupt input arguments, None is returned.

lab_2_retrieval_w_bm25.main.save_index(index: list[dict[str, float]], file_path: str) None

Save the index to a file.

Parameters:
  • index (list[dict[str, float]]) – The index to save.

  • file_path (str) – The path to the file where the index will be saved.

lab_2_retrieval_w_bm25.main.tokenize(text: str) list[str] | None

Tokenize the input text into lowercase words without punctuation, digits and other symbols.

Parameters:

text (str) – The input text to tokenize.

Returns:

A list of words from the text.

Return type:

list[str] | None

In case of corrupt input arguments, None is returned.