lab_3_ann_retriever package
Submodules
Lab 3.
Vector search with text retrieving
- class lab_3_ann_retriever.main.AdvancedSearchEngine(vectorizer: Vectorizer, tokenizer: Tokenizer)
Bases:
SearchEngine
Retriever based on KDTree algorithm.
- __init__(vectorizer: Vectorizer, tokenizer: Tokenizer) None
Initialize an instance of the AdvancedSearchEngine class.
- Parameters:
vectorizer (Vectorizer) – Vectorizer for documents vectorization
tokenizer (Tokenizer) – Tokenizer for tokenization
- class lab_3_ann_retriever.main.BasicSearchEngine(vectorizer: Vectorizer, tokenizer: Tokenizer)
Bases:
object
Engine based on KNN algorithm.
- __init__(vectorizer: Vectorizer, tokenizer: Tokenizer) None
Initialize an instance of the BasicSearchEngine class.
- Parameters:
vectorizer (Vectorizer) – Vectorizer for documents vectorization
tokenizer (Tokenizer) – Tokenizer for tokenization
- _calculate_knn(query_vector: tuple[float, ...], document_vectors: list[tuple[float, ...]], n_neighbours: int) list[tuple[int, float]] | None
Find nearest neighbours for a query vector.
- Parameters:
- Returns:
Nearest neighbours indices and distances
- Return type:
In case of corrupt input arguments, None is returned.
- _dump_documents() dict
Dump documents states for save the Engine.
- Returns:
document and document_vectors states
- Return type:
- _index_document(document: str) tuple[float, ...] | None
Index document.
- Parameters:
document (str) – Document to index
- Returns:
Returns document vector
- Return type:
Vector | None
In case of corrupt input arguments, None is returned.
- _vectorizer: Vectorizer
- index_documents(documents: list[str]) bool
Index documents for engine.
- Parameters:
- Returns:
Returns True if documents are successfully indexed
- Return type:
In case of corrupt input arguments, False is returned.
- retrieve_relevant_documents(query: str, n_neighbours: int) list[tuple[float, str]] | None
Index documents for retriever.
- Parameters:
- Returns:
Relevant documents with their distances
- Return type:
In case of corrupt input arguments, None is returned.
- class lab_3_ann_retriever.main.KDTree
Bases:
NaiveKDTree
KDTree.
- _find_closest(vector: tuple[float, ...], k: int = 1) list[tuple[float, int]] | None
Get k nearest neighbours for vector by filling best list.
- Parameters:
vector (Vector) – Vector for getting knn
k (int) – The number of nearest neighbours to return
- Returns:
The list of k nearest neighbours
- Return type:
In case of corrupt input arguments, None is returned.
- class lab_3_ann_retriever.main.NaiveKDTree
Bases:
object
NaiveKDTree.
- _find_closest(vector: tuple[float, ...], k: int = 1) list[tuple[float, int]] | None
Get k nearest neighbours for vector by filling best list.
- Parameters:
vector (Vector) – Vector for getting knn
k (int) – The number of nearest neighbours to return
- Returns:
The list of k nearest neighbours
- Return type:
In case of corrupt input arguments, None is returned.
- build(vectors: list[tuple[float, ...]]) bool
Build tree.
- Parameters:
vectors (list[Vector]) – Vectors for tree building
- Returns:
True if tree was built, False in other cases
- Return type:
In case of corrupt input arguments, False is returned.
- query(vector: tuple[float, ...], k: int = 1) list[tuple[float, int]] | None
Get k nearest neighbours for vector.
- Parameters:
vector (Vector) – Vector to get k nearest neighbours
k (int) – Number of nearest neighbours to get
- Returns:
Nearest neighbours indices
- Return type:
In case of corrupt input arguments, None is returned.
- class lab_3_ann_retriever.main.Node(vector: tuple[float, ...] = (), payload: int = -1, left_node: NodeLike | None = None, right_node: NodeLike | None = None)
Bases:
NodeLike
Interface definition for Node for KDTree.
- __init__(vector: tuple[float, ...] = (), payload: int = -1, left_node: NodeLike | None = None, right_node: NodeLike | None = None) None
Initialize an instance of the Node class.
- _abc_impl = <_abc._abc_data object>
- _is_protocol = False
- class lab_3_ann_retriever.main.NodeLike(*args, **kwargs)
Bases:
Protocol
Type alias for a tree node.
- __init__(*args, **kwargs)
- _abc_impl = <_abc._abc_data object>
- _is_protocol = True
- class lab_3_ann_retriever.main.SearchEngine(vectorizer: Vectorizer, tokenizer: Tokenizer)
Bases:
BasicSearchEngine
Retriever based on KDTree algorithm.
- __init__(vectorizer: Vectorizer, tokenizer: Tokenizer) None
Initialize an instance of the SearchEngine class.
- Parameters:
vectorizer (Vectorizer) – Vectorizer for documents vectorization
tokenizer (Tokenizer) – Tokenizer for tokenization
- _tree: NaiveKDTree
- index_documents(documents: list[str]) bool
Index documents for retriever.
- Parameters:
- Returns:
Returns True if document is successfully indexed
- Return type:
In case of corrupt input arguments, False is returned.
- class lab_3_ann_retriever.main.Tokenizer(stop_words: list[str])
Bases:
object
Tokenizer with removing stop words.
- _remove_stop_words(tokens: list[str]) list[str] | None
Remove stopwords from the list of tokens.
- Parameters:
- Returns:
Tokens after removing stopwords
- Return type:
In case of corrupt input arguments, None is returned.
- lab_3_ann_retriever.main.Vector
Type alias for vector representation of a text.
- class lab_3_ann_retriever.main.Vectorizer(corpus: list[list[str]])
Bases:
object
TF-IDF Vectorizer.
- _calculate_tf_idf(document: list[str]) tuple[float, ...] | None
Get TF-IDF for document.
- Parameters:
- Returns:
TF-IDF vector for document
- Return type:
Vector | None
In case of corrupt input arguments, None is returned.
- build() bool
Build vocabulary with tokenized_documents.
- Returns:
True if built successfully, False in other case
- Return type:
- load(file_path: str) bool
Save the Vectorizer state to file.
- Parameters:
file_path (str) – The path to the file from which to load the instance
- Returns:
True if the vectorizer was saved successfully
- Return type:
In case of corrupt input arguments, False is returned.
- vector2tokens(vector: tuple[float, ...]) list[str] | None
Recreate a tokenized document based on a vector.
- Parameters:
vector (Vector) – Vector to decode
- Returns:
Tokenized document
- Return type:
In case of corrupt input arguments, None is returned.
- vectorize(tokenized_document: list[str]) tuple[float, ...] | None
Create a vector for tokenized document.
- Parameters:
tokenized_document (list[str]) – Tokenized document to vectorize
- Returns:
TF-IDF vector for document
- Return type:
Vector | None
In case of corrupt input arguments, None is returned.
- lab_3_ann_retriever.main.calculate_distance(query_vector: tuple[float, ...], document_vector: tuple[float, ...]) float | None
Calculate Euclidean distance for a document vector.
- Parameters:
query_vector (Vector) – Vectorized query
document_vector (Vector) – Vectorized documents
- Returns:
Euclidean distance for vector
- Return type:
float | None
In case of corrupt input arguments, None is returned.
- lab_3_ann_retriever.main.load_vector(state: dict) tuple[float, ...] | None
Load vector from state.
- Parameters:
state (dict) – State of the vector to load from
- Returns:
Loaded vector
- Return type:
Vector | None
In case of corrupt input arguments, None is returned.
- lab_3_ann_retriever.main.save_vector(vector: tuple[float, ...]) dict
Prepare a vector for save.
- Parameters:
vector (Vector) – Vector to save
- Returns:
A state of the vector to save
- Return type:
Laboratory Work #3 starter.