lab_3_ann_retriever package

Submodules

Lab 3.

Vector search with text retrieving

class lab_3_ann_retriever.main.AdvancedSearchEngine(vectorizer: Vectorizer, tokenizer: Tokenizer)

Bases: SearchEngine

Retriever based on KDTree algorithm.

__init__(vectorizer: Vectorizer, tokenizer: Tokenizer) → None

Initialize an instance of the AdvancedSearchEngine class.

Parameters:

vectorizer (Vectorizer) – Vectorizer for documents vectorization
tokenizer (Tokenizer) – Tokenizer for tokenization

_tree: KDTree

class lab_3_ann_retriever.main.BasicSearchEngine(vectorizer: Vectorizer, tokenizer: Tokenizer)

Bases: object

Engine based on KNN algorithm.

__init__(vectorizer: Vectorizer, tokenizer: Tokenizer) → None

Initialize an instance of the BasicSearchEngine class.

Parameters:

vectorizer (Vectorizer) – Vectorizer for documents vectorization
tokenizer (Tokenizer) – Tokenizer for tokenization

_calculate_knn(query_vector: tuple[float, ...], document_vectors: list[tuple[float, ...]], n_neighbours: int) → list[tuple[int, float]] | None

Find nearest neighbours for a query vector.

Parameters:

query_vector (Vector) – Vectorized query
document_vectors (list[Vector]) – Vectorized documents
n_neighbours (int) – Number of neighbours to return

Returns:

Nearest neighbours indices and distances

Return type:

list[tuple[int, float]] | None

In case of corrupt input arguments, None is returned.

_document_vectors: list[tuple[float, ...]]

_documents: list[str]

_dump_documents() → dict

Dump documents states for save the Engine.

Returns:: document and document_vectors states
Return type:: dict

_index_document(document: str) → tuple[float, ...] | None

Index document.

Parameters:: document (str) – Document to index
Returns:: Returns document vector
Return type:: Vector | None

In case of corrupt input arguments, None is returned.

_load_documents(state: dict) → bool

Load documents from state.

Parameters:: state (dict) – state with documents
Returns:: True if documents were loaded, False in other cases
Return type:: bool

_tokenizer: Tokenizer

_vectorizer: Vectorizer

index_documents(documents: list[str]) → bool

Index documents for engine.

Parameters:: documents (list[str]) – Documents to index
Returns:: Returns True if documents are successfully indexed
Return type:: bool

In case of corrupt input arguments, False is returned.

load(file_path: str) → bool

Load engine from state.

Parameters:: file_path (str) – The path to the file with state
Returns:: True if engine was loaded, False in other cases
Return type:: bool

retrieve_relevant_documents(query: str, n_neighbours: int) → list[tuple[float, str]] | None

Index documents for retriever.

Parameters:

query (str) – Query for obtaining relevant documents
n_neighbours (int) – Number of relevant documents to return

Returns:

Relevant documents with their distances

Return type:

list[tuple[float, str]] | None

In case of corrupt input arguments, None is returned.

retrieve_vectorized(query_vector: tuple[float, ...]) → str | None

Retrieve document by vector.

Parameters:: query_vector (Vector) – Question vector
Returns:: Answer document
Return type:: str | None

In case of corrupt input arguments, None is returned.

save(file_path: str) → bool

Save the Vectorizer state to file.

Parameters:: file_path (str) – The path to the file where to save the instance
Returns:: returns True if save was done correctly, False in another cases
Return type:: bool

class lab_3_ann_retriever.main.KDTree

Bases: NaiveKDTree

KDTree.

_find_closest(vector: tuple[float, ...], k: int = 1) → list[tuple[float, int]] | None

Get k nearest neighbours for vector by filling best list.

Parameters:

vector (Vector) – Vector for getting knn
k (int) – The number of nearest neighbours to return

Returns:

The list of k nearest neighbours

Return type:

list[tuple[float, int]] | None

In case of corrupt input arguments, None is returned.

class lab_3_ann_retriever.main.NaiveKDTree

Bases: object

NaiveKDTree.

__init__() → None: Initialize an instance of the KDTree class.

_find_closest(vector: tuple[float, ...], k: int = 1) → list[tuple[float, int]] | None

Get k nearest neighbours for vector by filling best list.

Parameters:

vector (Vector) – Vector for getting knn
k (int) – The number of nearest neighbours to return

Returns:

The list of k nearest neighbours

Return type:

list[tuple[float, int]] | None

In case of corrupt input arguments, None is returned.

_root: NodeLike | None

build(vectors: list[tuple[float, ...]]) → bool

Build tree.

Parameters:: vectors (list[Vector]) – Vectors for tree building
Returns:: True if tree was built, False in other cases
Return type:: bool

In case of corrupt input arguments, False is returned.

load(state: dict) → bool

Load NaiveKDTree instance from state.

Parameters:: state (dict) – saved state of the NaiveKDTree
Returns:: True is loaded successfully, False in other cases
Return type:: bool

query(vector: tuple[float, ...], k: int = 1) → list[tuple[float, int]] | None

Get k nearest neighbours for vector.

Parameters:

vector (Vector) – Vector to get k nearest neighbours
k (int) – Number of nearest neighbours to get

Returns:

Nearest neighbours indices

Return type:

list[tuple[float, int]] | None

In case of corrupt input arguments, None is returned.

save() → dict | None

Save NaiveKDTree instance to state.

Returns:: state of the NaiveKDTree instance
Return type:: dict | None

In case of corrupt input arguments, None is returned.

class lab_3_ann_retriever.main.Node(vector: tuple[float, ...] = (), payload: int = -1, left_node: NodeLike | None = None, right_node: NodeLike | None = None)

Bases: NodeLike

Interface definition for Node for KDTree.

__init__(vector: tuple[float, ...] = (), payload: int = -1, left_node: NodeLike | None = None, right_node: NodeLike | None = None) → None

Initialize an instance of the Node class.

Parameters:

vector (Vector) – Current vector node
payload (int) – Index of current vector
left_node (NodeLike | None) – Left node
right_node (NodeLike | None) – Right node

_abc_impl = <_abc._abc_data object>

_is_protocol = False

left_node: NodeLike | None

load(state: dict[str, dict | int]) → bool

Load Node instance from state.

Parameters:: state (dict[str, dict | int]) – Saved state of the Node
Returns:: True if Node was loaded successfully, False in other cases.
Return type:: bool

payload: int

right_node: NodeLike | None

save() → dict

Save Node instance to state.

Returns:: state of the Node instance
Return type:: dict

vector: tuple[float, ...]

class lab_3_ann_retriever.main.NodeLike(*args, **kwargs)

Bases: Protocol

Type alias for a tree node.

__init__(*args, **kwargs)

_abc_impl = <_abc._abc_data object>

_is_protocol = True

load(state: dict) → bool

Load Node instance from state.

Parameters:: state (dict) – Saved state of the Node
Returns:: True if Node was loaded successfully, False in other cases
Return type:: bool

save() → dict

Save Node instance to state.

Returns:: State of the Node instance
Return type:: dict

class lab_3_ann_retriever.main.SearchEngine(vectorizer: Vectorizer, tokenizer: Tokenizer)

Bases: BasicSearchEngine

Retriever based on KDTree algorithm.

__init__(vectorizer: Vectorizer, tokenizer: Tokenizer) → None

Initialize an instance of the SearchEngine class.

Parameters:

vectorizer (Vectorizer) – Vectorizer for documents vectorization
tokenizer (Tokenizer) – Tokenizer for tokenization

_tree: NaiveKDTree

index_documents(documents: list[str]) → bool

Index documents for retriever.

Parameters:: documents (list[str]) – Documents to index
Returns:: Returns True if document is successfully indexed
Return type:: bool

In case of corrupt input arguments, False is returned.

load(file_path: str) → bool

Load a SearchEngine instance from a file.

Parameters:: file_path (str) – The path to the file from which to load the instance
Returns:: True if engine was loaded successfully, False in other cases
Return type:: bool

retrieve_relevant_documents(query: str, n_neighbours: int = 1) → list[tuple[float, str]] | None

Index documents for retriever.

Parameters:

query (str) – Query for obtaining relevant documents.
n_neighbours (int) – Number of relevant documents to return.

Returns:

Relevant documents with their distances.

Return type:

list[tuple[float, str]] | None

In case of corrupt input arguments, None is returned.

save(file_path: str) → bool

Save the SearchEngine instance to a file.

Parameters:: file_path (str) – The path to the file where the instance should be saved
Returns:: True if saved successfully, False in other case
Return type:: bool

class lab_3_ann_retriever.main.Tokenizer(stop_words: list[str])

Bases: object

Tokenizer with removing stop words.

__init__(stop_words: list[str]) → None

Initialize an instance of the Tokenizer class.

Parameters:: stop_words (list[str]) – List with stop words

_remove_stop_words(tokens: list[str]) → list[str] | None

Remove stopwords from the list of tokens.

Parameters:: tokens (list[str]) – List of tokens
Returns:: Tokens after removing stopwords
Return type:: list[str] | None

In case of corrupt input arguments, None is returned.

_stop_words: list[str]

tokenize(text: str) → list[str] | None

Tokenize the input text into lowercase words without punctuation, digits and other symbols.

Parameters:: text (str) – The input text to tokenize
Returns:: A list of words from the text
Return type:: list[str] | None

In case of corrupt input arguments, None is returned.

tokenize_documents(documents: list[str]) → list[list[str]] | None

Tokenize the input documents.

Parameters:: documents (list[str]) – Documents to tokenize
Returns:: A list of tokenized documents
Return type:: list[list[str]] | None

In case of corrupt input arguments, None is returned.

lab_3_ann_retriever.main.Vector: Type alias for vector representation of a text.

class lab_3_ann_retriever.main.Vectorizer(corpus: list[list[str]])

Bases: object

TF-IDF Vectorizer.

__init__(corpus: list[list[str]]) → None

Initialize an instance of the Vectorizer class.

Parameters:: corpus (list[list[str]]) – Tokenized documents to vectorize

_calculate_tf_idf(document: list[str]) → tuple[float, ...] | None

Get TF-IDF for document.

Parameters:: document (list[str]) – Tokenized document to vectorize
Returns:: TF-IDF vector for document
Return type:: Vector | None

In case of corrupt input arguments, None is returned.

_corpus: list[list[str]]

_idf_values: dict[str, float]

_token2ind: dict[str, int]

_vocabulary: list[str]

build() → bool

Build vocabulary with tokenized_documents.

Returns:: True if built successfully, False in other case
Return type:: bool

load(file_path: str) → bool

Save the Vectorizer state to file.

Parameters:: file_path (str) – The path to the file from which to load the instance
Returns:: True if the vectorizer was saved successfully
Return type:: bool

In case of corrupt input arguments, False is returned.

save(file_path: str) → bool

Save the Vectorizer state to file.

Parameters:: file_path (str) – The path to the file where the instance should be saved
Returns:: True if saved successfully, False in other case
Return type:: bool

vector2tokens(vector: tuple[float, ...]) → list[str] | None

Recreate a tokenized document based on a vector.

Parameters:: vector (Vector) – Vector to decode
Returns:: Tokenized document
Return type:: list[str] | None

In case of corrupt input arguments, None is returned.

vectorize(tokenized_document: list[str]) → tuple[float, ...] | None

Create a vector for tokenized document.

Parameters:: tokenized_document (list[str]) – Tokenized document to vectorize
Returns:: TF-IDF vector for document
Return type:: Vector | None

In case of corrupt input arguments, None is returned.

lab_3_ann_retriever.main.calculate_distance(query_vector: tuple[float, ...], document_vector: tuple[float, ...]) → float | None

Calculate Euclidean distance for a document vector.

Parameters:

query_vector (Vector) – Vectorized query
document_vector (Vector) – Vectorized documents

Returns:

Euclidean distance for vector

Return type:

float | None

In case of corrupt input arguments, None is returned.

lab_3_ann_retriever.main.load_vector(state: dict) → tuple[float, ...] | None

Load vector from state.

Parameters:: state (dict) – State of the vector to load from
Returns:: Loaded vector
Return type:: Vector | None

In case of corrupt input arguments, None is returned.

lab_3_ann_retriever.main.save_vector(vector: tuple[float, ...]) → dict

Prepare a vector for save.

Parameters:: vector (Vector) – Vector to save
Returns:: A state of the vector to save
Return type:: dict

Laboratory Work #3 starter.

lab_3_ann_retriever.start.main() → None: Launch an implementation.

lab_3_ann_retriever.start.open_files() → tuple[list[str], list[str]]

# stubs: keep.

Open files.

Returns:: Documents and stopwords
Return type:: tuple[list[str], list[str]]