lab_4_retrieval_w_clustering package

Submodules

Lab 4.

Vector search with clusterization

class lab_4_retrieval_w_clustering.main.BM25Vectorizer

Bases: Vectorizer

BM25 Vectorizer.

__init__() None

Initialize an instance of the BM25Vectorizer class.

_avg_doc_len: float
_calculate_bm25(tokenized_document: list[str]) tuple[float, ...]

Get BM25 vector for tokenized document.

Parameters:

tokenized_document (list[str]) – Tokenized document to vectorize.

Raises:

ValueError – In case of inappropriate type input argument or if input argument is empty.

Returns:

BM25 vector for document.

Return type:

Vector

_corpus: list[list[str]]
set_tokenized_corpus(tokenized_corpus: list[list[str]]) None

Set tokenized corpus and average document length.

Parameters:

tokenized_corpus (TokenizedCorpus) – Tokenized texts corpus.

Raises:

ValueError – In case of inappropriate type input argument or if input argument is empty.

vectorize(tokenized_document: list[str]) tuple[float, ...]

Create a vector for tokenized document.

Parameters:

tokenized_document (list[str]) – Tokenized document to vectorize.

Raises:

ValueError – In case of inappropriate type input arguments, or if input arguments are empty, or if methods used return None.

Returns:

BM25 vector for document.

Return type:

Vector

class lab_4_retrieval_w_clustering.main.ClusterDTO(centroid_vector: tuple[float, ...])

Bases: object

Store clusters.

__centroid: tuple[float, ...]
__indices: list[int]
__init__(centroid_vector: tuple[float, ...]) None

Initialize an instance of the ClusterDTO class.

Parameters:

centroid_vector (Vector) – Centroid vector.

__len__() int

Return the number of document indices.

Returns:

The number of document indices.

Return type:

int

add_document_index(index: int) None

Add document index.

Parameters:

index (int) – Index of document.

Raises:

ValueError – In case of inappropriate type input arguments, or if input arguments are empty.

erase_indices() None

Clear indexes.

get_centroid() tuple[float, ...]

Get cluster centroid.

Returns:

Centroid of current cluster.

Return type:

Vector

get_indices() list[int]

Get indices.

Returns:

Indices of documents.

Return type:

list[int]

set_new_centroid(new_centroid: tuple[float, ...]) None

Set new centroid for cluster.

Parameters:

new_centroid (Vector) – New centroid vector.

Raises:

ValueError – In case of inappropriate type input arguments, or if input arguments are empty.

class lab_4_retrieval_w_clustering.main.ClusteringSearchEngine(db: DocumentVectorDB, n_clusters: int = 3)

Bases: object

Engine based on KMeans algorithm.

__algo: KMeans
__init__(db: DocumentVectorDB, n_clusters: int = 3) None

Initialize an instance of the ClusteringSearchEngine class.

Parameters:
  • db (DocumentVectorDB) – An instance of DocumentVectorDB class.

  • n_clusters (int) – Number of clusters.

_db: DocumentVectorDB
calculate_square_sum() float

Get sum by all clusters of sum of squares of distance from vector of clusters to centroid.

Returns:

Sum of squares of distance from vector of clusters to centroid.

Return type:

float

make_report(num_examples: int, output_path: str) None

Create report by clusters.

Parameters:
  • num_examples (int) – number of examples for each cluster

  • output_path (str) – path to output file

retrieve_relevant_documents(query: str, n_neighbours: int) list[tuple[float, str]]

Get relevant documents.

Parameters:
  • query (str) – Query for obtaining relevant documents.

  • n_neighbours (int) – Number of relevant documents to return.

Raises:

ValueError – In case of inappropriate type input arguments, or if input arguments are empty, or if input arguments are incorrect, or if methods used return None.

Returns:

Relevant documents with their distances.

Return type:

list[tuple[float, str]]

lab_4_retrieval_w_clustering.main.Corpus

Type alias for corpus of texts.

class lab_4_retrieval_w_clustering.main.DocumentVectorDB(stop_words: list[str])

Bases: object

Document and vector database.

__documents: list[str]
__init__(stop_words: list[str]) None

Initialize an instance of the DocumentVectorDB class.

Parameters:

stop_words (list[str]) – List with stop words.

__vectors: dict[int, tuple[float, ...]]
_tokenizer: Tokenizer
_vectorizer: BM25Vectorizer
get_raw_documents(indices: tuple[int, ...] | None = None) list[str]

Get documents by indices.

Parameters:

indices (tuple[int, ...] | None) – Document indices.

Raises:

ValueError – In case of inappropriate type input argument.

Returns:

List of documents.

Return type:

Corpus

get_tokenizer() Tokenizer

Get an object of the Tokenizer class.

Returns:

Tokenizer class object.

Return type:

Tokenizer

get_vectorizer() BM25Vectorizer

Get an object of the BM25Vectorizer class.

Returns:

BM25Vectorizer class object.

Return type:

BM25Vectorizer

get_vectors(indices: list[int] | None = None) list[tuple[int, tuple[float, ...]]]

Get document vectors by indices.

Parameters:

indices (list[int] | None) – Document indices.

Returns:

List of index and vector for documents.

Return type:

list[tuple[int, Vector]]

put_corpus(corpus: list[str]) None

Fill documents and vectors based on corpus.

Parameters:

corpus (Corpus) – Corpus of texts.

Raises:

ValueError – In case of inappropriate type input arguments, or if input arguments are empty, or if methods used return None.

class lab_4_retrieval_w_clustering.main.KMeans(db: DocumentVectorDB, n_clusters: int)

Bases: object

Train k-means algorithm.

__clusters: list[ClusterDTO]
__init__(db: DocumentVectorDB, n_clusters: int) None

Initialize an instance of the KMeans class.

Parameters:
  • db (DocumentVectorDB) – An instance of DocumentVectorDB class.

  • n_clusters (int) – Number of clusters.

_db: DocumentVectorDB
_is_convergence_reached(new_clusters: list[ClusterDTO], threshold: float = 1e-07) bool

Check the convergence of centroids.

Parameters:
  • new_clusters (list[ClusterDTO]) – Centroids after updating.

  • threshold (float) – Threshold for determining the distance correctness.

Raises:

ValueError – In case of inappropriate type input arguments, or if input arguments are empty, or if methods used return None.

Returns:

True if the distance is correct, False in other cases.

Return type:

bool

_n_clusters: int
calculate_square_sum() float

Get sum of squares of distance from vectors of clusters to their centroid.

Returns:

Sum of squares of distance from vector of clusters to centroid.

Return type:

float

get_clusters_info(num_examples: int) list[dict[str, int | list[str]]]

Get clusters information.

Parameters:

num_examples (int) – Number of examples for each cluster

Returns:

List with information about each cluster

Return type:

list[dict[str, int| list[str]]]

infer(query_vector: tuple[float, ...], n_neighbours: int) list[tuple[float, int]]

Launch clustering model inference.

Parameters:
  • query_vector (Vector) – Vector of query for obtaining relevant documents.

  • n_neighbours (int) – Number of relevant documents to return.

Raises:

ValueError – In case of inappropriate type input arguments, or if input arguments are empty, or if input arguments are incorrect, or if methods used return None.

Returns:

Distance to relevant document and document index.

Return type:

list[tuple[float, int]]

run_single_train_iteration() list[ClusterDTO]

Run single train iteration.

Raises:

ValueError – In case of if methods used return None.

Returns:

List of clusters.

Return type:

list[ClusterDTO]

train() None

Train k-means algorithm.

lab_4_retrieval_w_clustering.main.TokenizedCorpus

Type alias for tokenized texts.

class lab_4_retrieval_w_clustering.main.VectorDBAdvancedSearchEngine(db: DocumentVectorDB)

Bases: VectorDBEngine

Engine provided unified interface to AdvancedSearchEngine.

__init__(db: DocumentVectorDB) None

Initialize an instance of the VectorDBAdvancedSearchEngine class.

Parameters:

db (DocumentVectorDB) – An instance of DocumentVectorDB class.

class lab_4_retrieval_w_clustering.main.VectorDBEngine(db: DocumentVectorDB, engine: BasicSearchEngine)

Bases: object

Engine wrapper that encapsulates different engines and provides unified API to it.

__init__(db: DocumentVectorDB, engine: BasicSearchEngine) None

Initialize an instance of the ClusteringSearchEngine class.

Parameters:
_db: DocumentVectorDB
_engine: BasicSearchEngine
retrieve_relevant_documents(query: str, n_neighbours: int) list[tuple[float, str]] | None

Index documents for retriever.

Parameters:
  • query (str) – Query for obtaining relevant documents.

  • n_neighbours (int) – Number of relevant documents to return.

Returns:

Relevant documents with their distances.

Return type:

list[tuple[float, str]] | None

class lab_4_retrieval_w_clustering.main.VectorDBSearchEngine(db: DocumentVectorDB)

Bases: BasicSearchEngine

Engine based on VectorDB.

__init__(db: DocumentVectorDB) None

Initialize an instance of the RerankerEngine class.

Parameters:

db (DocumentVectorDB) – Object of DocumentVectorDB class.

_db: DocumentVectorDB
retrieve_relevant_documents(query: str, n_neighbours: int) list[tuple[float, str]]

Get relevant documents.

Parameters:
  • query (str) – Query for obtaining relevant documents.

  • n_neighbours (int) – Number of relevant documents to return.

Returns:

Relevant documents with their distances.

Return type:

list[tuple[float, str]]

class lab_4_retrieval_w_clustering.main.VectorDBTreeSearchEngine(db: DocumentVectorDB)

Bases: VectorDBEngine

Engine provided unified interface to SearchEngine.

__init__(db: DocumentVectorDB) None

Initialize an instance of the VectorDBTreeSearchEngine class.

Parameters:

db (DocumentVectorDB) – An instance of DocumentVectorDB class.

lab_4_retrieval_w_clustering.main.get_paragraphs(text: str) list[str]

Split text to paragraphs.

Parameters:

text (str) – Text to split in paragraphs.

Raises:

ValueError – In case of inappropriate type input argument or if input argument is empty.

Returns:

Paragraphs from document.

Return type:

list[str]