lab_4_retrieval_w_clustering package
Submodules
Lab 4.
Vector search with clusterization
- class lab_4_retrieval_w_clustering.main.BM25Vectorizer
Bases:
Vectorizer
BM25 Vectorizer.
- _calculate_bm25(tokenized_document: list[str]) tuple[float, ...]
Get BM25 vector for tokenized document.
- Parameters:
tokenized_document (list[str]) – Tokenized document to vectorize.
- Raises:
ValueError – In case of inappropriate type input argument or if input argument is empty.
- Returns:
BM25 vector for document.
- Return type:
Vector
- set_tokenized_corpus(tokenized_corpus: list[list[str]]) None
Set tokenized corpus and average document length.
- Parameters:
tokenized_corpus (TokenizedCorpus) – Tokenized texts corpus.
- Raises:
ValueError – In case of inappropriate type input argument or if input argument is empty.
- vectorize(tokenized_document: list[str]) tuple[float, ...]
Create a vector for tokenized document.
- Parameters:
tokenized_document (list[str]) – Tokenized document to vectorize.
- Raises:
ValueError – In case of inappropriate type input arguments, or if input arguments are empty, or if methods used return None.
- Returns:
BM25 vector for document.
- Return type:
Vector
- class lab_4_retrieval_w_clustering.main.ClusterDTO(centroid_vector: tuple[float, ...])
Bases:
object
Store clusters.
- __init__(centroid_vector: tuple[float, ...]) None
Initialize an instance of the ClusterDTO class.
- Parameters:
centroid_vector (Vector) – Centroid vector.
- __len__() int
Return the number of document indices.
- Returns:
The number of document indices.
- Return type:
- add_document_index(index: int) None
Add document index.
- Parameters:
index (int) – Index of document.
- Raises:
ValueError – In case of inappropriate type input arguments, or if input arguments are empty.
- get_centroid() tuple[float, ...]
Get cluster centroid.
- Returns:
Centroid of current cluster.
- Return type:
Vector
- set_new_centroid(new_centroid: tuple[float, ...]) None
Set new centroid for cluster.
- Parameters:
new_centroid (Vector) – New centroid vector.
- Raises:
ValueError – In case of inappropriate type input arguments, or if input arguments are empty.
- class lab_4_retrieval_w_clustering.main.ClusteringSearchEngine(db: DocumentVectorDB, n_clusters: int = 3)
Bases:
object
Engine based on KMeans algorithm.
- __init__(db: DocumentVectorDB, n_clusters: int = 3) None
Initialize an instance of the ClusteringSearchEngine class.
- Parameters:
db (DocumentVectorDB) – An instance of DocumentVectorDB class.
n_clusters (int) – Number of clusters.
- _db: DocumentVectorDB
- calculate_square_sum() float
Get sum by all clusters of sum of squares of distance from vector of clusters to centroid.
- Returns:
Sum of squares of distance from vector of clusters to centroid.
- Return type:
- retrieve_relevant_documents(query: str, n_neighbours: int) list[tuple[float, str]]
Get relevant documents.
- Parameters:
- Raises:
ValueError – In case of inappropriate type input arguments, or if input arguments are empty, or if input arguments are incorrect, or if methods used return None.
- Returns:
Relevant documents with their distances.
- Return type:
- lab_4_retrieval_w_clustering.main.Corpus
Type alias for corpus of texts.
- class lab_4_retrieval_w_clustering.main.DocumentVectorDB(stop_words: list[str])
Bases:
object
Document and vector database.
- _vectorizer: BM25Vectorizer
- get_raw_documents(indices: tuple[int, ...] | None = None) list[str]
Get documents by indices.
- Parameters:
- Raises:
ValueError – In case of inappropriate type input argument.
- Returns:
List of documents.
- Return type:
Corpus
- get_tokenizer() Tokenizer
Get an object of the Tokenizer class.
- Returns:
Tokenizer class object.
- Return type:
- get_vectorizer() BM25Vectorizer
Get an object of the BM25Vectorizer class.
- Returns:
BM25Vectorizer class object.
- Return type:
- get_vectors(indices: list[int] | None = None) list[tuple[int, tuple[float, ...]]]
Get document vectors by indices.
- put_corpus(corpus: list[str]) None
Fill documents and vectors based on corpus.
- Parameters:
corpus (Corpus) – Corpus of texts.
- Raises:
ValueError – In case of inappropriate type input arguments, or if input arguments are empty, or if methods used return None.
- class lab_4_retrieval_w_clustering.main.KMeans(db: DocumentVectorDB, n_clusters: int)
Bases:
object
Train k-means algorithm.
- __clusters: list[ClusterDTO]
- __init__(db: DocumentVectorDB, n_clusters: int) None
Initialize an instance of the KMeans class.
- Parameters:
db (DocumentVectorDB) – An instance of DocumentVectorDB class.
n_clusters (int) – Number of clusters.
- _db: DocumentVectorDB
- _is_convergence_reached(new_clusters: list[ClusterDTO], threshold: float = 1e-07) bool
Check the convergence of centroids.
- Parameters:
new_clusters (list[ClusterDTO]) – Centroids after updating.
threshold (float) – Threshold for determining the distance correctness.
- Raises:
ValueError – In case of inappropriate type input arguments, or if input arguments are empty, or if methods used return None.
- Returns:
True if the distance is correct, False in other cases.
- Return type:
- calculate_square_sum() float
Get sum of squares of distance from vectors of clusters to their centroid.
- Returns:
Sum of squares of distance from vector of clusters to centroid.
- Return type:
- infer(query_vector: tuple[float, ...], n_neighbours: int) list[tuple[float, int]]
Launch clustering model inference.
- Parameters:
query_vector (Vector) – Vector of query for obtaining relevant documents.
n_neighbours (int) – Number of relevant documents to return.
- Raises:
ValueError – In case of inappropriate type input arguments, or if input arguments are empty, or if input arguments are incorrect, or if methods used return None.
- Returns:
Distance to relevant document and document index.
- Return type:
- run_single_train_iteration() list[ClusterDTO]
Run single train iteration.
- Raises:
ValueError – In case of if methods used return None.
- Returns:
List of clusters.
- Return type:
- lab_4_retrieval_w_clustering.main.TokenizedCorpus
Type alias for tokenized texts.
- class lab_4_retrieval_w_clustering.main.VectorDBAdvancedSearchEngine(db: DocumentVectorDB)
Bases:
VectorDBEngine
Engine provided unified interface to AdvancedSearchEngine.
- __init__(db: DocumentVectorDB) None
Initialize an instance of the VectorDBAdvancedSearchEngine class.
- Parameters:
db (DocumentVectorDB) – An instance of DocumentVectorDB class.
- class lab_4_retrieval_w_clustering.main.VectorDBEngine(db: DocumentVectorDB, engine: BasicSearchEngine)
Bases:
object
Engine wrapper that encapsulates different engines and provides unified API to it.
- __init__(db: DocumentVectorDB, engine: BasicSearchEngine) None
Initialize an instance of the ClusteringSearchEngine class.
- Parameters:
db (DocumentVectorDB) – An instance of DocumentVectorDB class.
engine (BasicSearchEngine) – A search engine.
- _db: DocumentVectorDB
- _engine: BasicSearchEngine
- class lab_4_retrieval_w_clustering.main.VectorDBSearchEngine(db: DocumentVectorDB)
Bases:
BasicSearchEngine
Engine based on VectorDB.
- __init__(db: DocumentVectorDB) None
Initialize an instance of the RerankerEngine class.
- Parameters:
db (DocumentVectorDB) – Object of DocumentVectorDB class.
- _db: DocumentVectorDB
- class lab_4_retrieval_w_clustering.main.VectorDBTreeSearchEngine(db: DocumentVectorDB)
Bases:
VectorDBEngine
Engine provided unified interface to SearchEngine.
- __init__(db: DocumentVectorDB) None
Initialize an instance of the VectorDBTreeSearchEngine class.
- Parameters:
db (DocumentVectorDB) – An instance of DocumentVectorDB class.