lab_3_ann_retriever package

Submodules

Lab 3.

Vector search with text retrieving

class lab_3_ann_retriever.main.AdvancedSearchEngine(vectorizer: Vectorizer, tokenizer: Tokenizer)

Bases: SearchEngine

Retriever based on KDTree algorithm.

__init__(vectorizer: Vectorizer, tokenizer: Tokenizer) None

Initialize an instance of the AdvancedSearchEngine class.

Parameters:
  • vectorizer (Vectorizer) – Vectorizer for documents vectorization

  • tokenizer (Tokenizer) – Tokenizer for tokenization

_tree: KDTree
class lab_3_ann_retriever.main.BasicSearchEngine(vectorizer: Vectorizer, tokenizer: Tokenizer)

Bases: object

Engine based on KNN algorithm.

__init__(vectorizer: Vectorizer, tokenizer: Tokenizer) None

Initialize an instance of the BasicSearchEngine class.

Parameters:
  • vectorizer (Vectorizer) – Vectorizer for documents vectorization

  • tokenizer (Tokenizer) – Tokenizer for tokenization

_calculate_knn(query_vector: tuple[float, ...], document_vectors: list[tuple[float, ...]], n_neighbours: int) list[tuple[int, float]] | None

Find nearest neighbours for a query vector.

Parameters:
  • query_vector (Vector) – Vectorized query

  • document_vectors (list[Vector]) – Vectorized documents

  • n_neighbours (int) – Number of neighbours to return

Returns:

Nearest neighbours indices and distances

Return type:

list[tuple[int, float]] | None

In case of corrupt input arguments, None is returned.

_document_vectors: list[tuple[float, ...]]
_documents: list[str]
_dump_documents() dict

Dump documents states for save the Engine.

Returns:

document and document_vectors states

Return type:

dict

_index_document(document: str) tuple[float, ...] | None

Index document.

Parameters:

document (str) – Document to index

Returns:

Returns document vector

Return type:

Vector | None

In case of corrupt input arguments, None is returned.

_load_documents(state: dict) bool

Load documents from state.

Parameters:

state (dict) – state with documents

Returns:

True if documents were loaded, False in other cases

Return type:

bool

_tokenizer: Tokenizer
_vectorizer: Vectorizer
index_documents(documents: list[str]) bool

Index documents for engine.

Parameters:

documents (list[str]) – Documents to index

Returns:

Returns True if documents are successfully indexed

Return type:

bool

In case of corrupt input arguments, False is returned.

load(file_path: str) bool

Load engine from state.

Parameters:

file_path (str) – The path to the file with state

Returns:

True if engine was loaded, False in other cases

Return type:

bool

retrieve_relevant_documents(query: str, n_neighbours: int) list[tuple[float, str]] | None

Index documents for retriever.

Parameters:
  • query (str) – Query for obtaining relevant documents

  • n_neighbours (int) – Number of relevant documents to return

Returns:

Relevant documents with their distances

Return type:

list[tuple[float, str]] | None

In case of corrupt input arguments, None is returned.

retrieve_vectorized(query_vector: tuple[float, ...]) str | None

Retrieve document by vector.

Parameters:

query_vector (Vector) – Question vector

Returns:

Answer document

Return type:

str | None

In case of corrupt input arguments, None is returned.

save(file_path: str) bool

Save the Vectorizer state to file.

Parameters:

file_path (str) – The path to the file where to save the instance

Returns:

returns True if save was done correctly, False in another cases

Return type:

bool

class lab_3_ann_retriever.main.KDTree

Bases: NaiveKDTree

KDTree.

_find_closest(vector: tuple[float, ...], k: int = 1) list[tuple[float, int]] | None

Get k nearest neighbours for vector by filling best list.

Parameters:
  • vector (Vector) – Vector for getting knn

  • k (int) – The number of nearest neighbours to return

Returns:

The list of k nearest neighbours

Return type:

list[tuple[float, int]] | None

In case of corrupt input arguments, None is returned.

class lab_3_ann_retriever.main.NaiveKDTree

Bases: object

NaiveKDTree.

__init__() None

Initialize an instance of the KDTree class.

_find_closest(vector: tuple[float, ...], k: int = 1) list[tuple[float, int]] | None

Get k nearest neighbours for vector by filling best list.

Parameters:
  • vector (Vector) – Vector for getting knn

  • k (int) – The number of nearest neighbours to return

Returns:

The list of k nearest neighbours

Return type:

list[tuple[float, int]] | None

In case of corrupt input arguments, None is returned.

_root: NodeLike | None
build(vectors: list[tuple[float, ...]]) bool

Build tree.

Parameters:

vectors (list[Vector]) – Vectors for tree building

Returns:

True if tree was built, False in other cases

Return type:

bool

In case of corrupt input arguments, False is returned.

load(state: dict) bool

Load NaiveKDTree instance from state.

Parameters:

state (dict) – saved state of the NaiveKDTree

Returns:

True is loaded successfully, False in other cases

Return type:

bool

query(vector: tuple[float, ...], k: int = 1) list[tuple[float, int]] | None

Get k nearest neighbours for vector.

Parameters:
  • vector (Vector) – Vector to get k nearest neighbours

  • k (int) – Number of nearest neighbours to get

Returns:

Nearest neighbours indices

Return type:

list[tuple[float, int]] | None

In case of corrupt input arguments, None is returned.

save() dict | None

Save NaiveKDTree instance to state.

Returns:

state of the NaiveKDTree instance

Return type:

dict | None

In case of corrupt input arguments, None is returned.

class lab_3_ann_retriever.main.Node(vector: tuple[float, ...] = (), payload: int = -1, left_node: NodeLike | None = None, right_node: NodeLike | None = None)

Bases: NodeLike

Interface definition for Node for KDTree.

__init__(vector: tuple[float, ...] = (), payload: int = -1, left_node: NodeLike | None = None, right_node: NodeLike | None = None) None

Initialize an instance of the Node class.

Parameters:
  • vector (Vector) – Current vector node

  • payload (int) – Index of current vector

  • left_node (NodeLike | None) – Left node

  • right_node (NodeLike | None) – Right node

_abc_impl = <_abc._abc_data object>
_is_protocol = False
left_node: NodeLike | None
load(state: dict[str, dict | int]) bool

Load Node instance from state.

Parameters:

state (dict[str, dict | int]) – Saved state of the Node

Returns:

True if Node was loaded successfully, False in other cases.

Return type:

bool

payload: int
right_node: NodeLike | None
save() dict

Save Node instance to state.

Returns:

state of the Node instance

Return type:

dict

vector: tuple[float, ...]
class lab_3_ann_retriever.main.NodeLike(*args, **kwargs)

Bases: Protocol

Type alias for a tree node.

__init__(*args, **kwargs)
_abc_impl = <_abc._abc_data object>
_is_protocol = True
load(state: dict) bool

Load Node instance from state.

Parameters:

state (dict) – Saved state of the Node

Returns:

True if Node was loaded successfully, False in other cases

Return type:

bool

save() dict

Save Node instance to state.

Returns:

State of the Node instance

Return type:

dict

class lab_3_ann_retriever.main.SearchEngine(vectorizer: Vectorizer, tokenizer: Tokenizer)

Bases: BasicSearchEngine

Retriever based on KDTree algorithm.

__init__(vectorizer: Vectorizer, tokenizer: Tokenizer) None

Initialize an instance of the SearchEngine class.

Parameters:
  • vectorizer (Vectorizer) – Vectorizer for documents vectorization

  • tokenizer (Tokenizer) – Tokenizer for tokenization

_tree: NaiveKDTree
index_documents(documents: list[str]) bool

Index documents for retriever.

Parameters:

documents (list[str]) – Documents to index

Returns:

Returns True if document is successfully indexed

Return type:

bool

In case of corrupt input arguments, False is returned.

load(file_path: str) bool

Load a SearchEngine instance from a file.

Parameters:

file_path (str) – The path to the file from which to load the instance

Returns:

True if engine was loaded successfully, False in other cases

Return type:

bool

retrieve_relevant_documents(query: str, n_neighbours: int = 1) list[tuple[float, str]] | None

Index documents for retriever.

Parameters:
  • query (str) – Query for obtaining relevant documents.

  • n_neighbours (int) – Number of relevant documents to return.

Returns:

Relevant documents with their distances.

Return type:

list[tuple[float, str]] | None

In case of corrupt input arguments, None is returned.

save(file_path: str) bool

Save the SearchEngine instance to a file.

Parameters:

file_path (str) – The path to the file where the instance should be saved

Returns:

True if saved successfully, False in other case

Return type:

bool

class lab_3_ann_retriever.main.Tokenizer(stop_words: list[str])

Bases: object

Tokenizer with removing stop words.

__init__(stop_words: list[str]) None

Initialize an instance of the Tokenizer class.

Parameters:

stop_words (list[str]) – List with stop words

_remove_stop_words(tokens: list[str]) list[str] | None

Remove stopwords from the list of tokens.

Parameters:

tokens (list[str]) – List of tokens

Returns:

Tokens after removing stopwords

Return type:

list[str] | None

In case of corrupt input arguments, None is returned.

_stop_words: list[str]
tokenize(text: str) list[str] | None

Tokenize the input text into lowercase words without punctuation, digits and other symbols.

Parameters:

text (str) – The input text to tokenize

Returns:

A list of words from the text

Return type:

list[str] | None

In case of corrupt input arguments, None is returned.

tokenize_documents(documents: list[str]) list[list[str]] | None

Tokenize the input documents.

Parameters:

documents (list[str]) – Documents to tokenize

Returns:

A list of tokenized documents

Return type:

list[list[str]] | None

In case of corrupt input arguments, None is returned.

lab_3_ann_retriever.main.Vector

Type alias for vector representation of a text.

class lab_3_ann_retriever.main.Vectorizer(corpus: list[list[str]])

Bases: object

TF-IDF Vectorizer.

__init__(corpus: list[list[str]]) None

Initialize an instance of the Vectorizer class.

Parameters:

corpus (list[list[str]]) – Tokenized documents to vectorize

_calculate_tf_idf(document: list[str]) tuple[float, ...] | None

Get TF-IDF for document.

Parameters:

document (list[str]) – Tokenized document to vectorize

Returns:

TF-IDF vector for document

Return type:

Vector | None

In case of corrupt input arguments, None is returned.

_corpus: list[list[str]]
_idf_values: dict[str, float]
_token2ind: dict[str, int]
_vocabulary: list[str]
build() bool

Build vocabulary with tokenized_documents.

Returns:

True if built successfully, False in other case

Return type:

bool

load(file_path: str) bool

Save the Vectorizer state to file.

Parameters:

file_path (str) – The path to the file from which to load the instance

Returns:

True if the vectorizer was saved successfully

Return type:

bool

In case of corrupt input arguments, False is returned.

save(file_path: str) bool

Save the Vectorizer state to file.

Parameters:

file_path (str) – The path to the file where the instance should be saved

Returns:

True if saved successfully, False in other case

Return type:

bool

vector2tokens(vector: tuple[float, ...]) list[str] | None

Recreate a tokenized document based on a vector.

Parameters:

vector (Vector) – Vector to decode

Returns:

Tokenized document

Return type:

list[str] | None

In case of corrupt input arguments, None is returned.

vectorize(tokenized_document: list[str]) tuple[float, ...] | None

Create a vector for tokenized document.

Parameters:

tokenized_document (list[str]) – Tokenized document to vectorize

Returns:

TF-IDF vector for document

Return type:

Vector | None

In case of corrupt input arguments, None is returned.

lab_3_ann_retriever.main.calculate_distance(query_vector: tuple[float, ...], document_vector: tuple[float, ...]) float | None

Calculate Euclidean distance for a document vector.

Parameters:
  • query_vector (Vector) – Vectorized query

  • document_vector (Vector) – Vectorized documents

Returns:

Euclidean distance for vector

Return type:

float | None

In case of corrupt input arguments, None is returned.

lab_3_ann_retriever.main.load_vector(state: dict) tuple[float, ...] | None

Load vector from state.

Parameters:

state (dict) – State of the vector to load from

Returns:

Loaded vector

Return type:

Vector | None

In case of corrupt input arguments, None is returned.

lab_3_ann_retriever.main.save_vector(vector: tuple[float, ...]) dict

Prepare a vector for save.

Parameters:

vector (Vector) – Vector to save

Returns:

A state of the vector to save

Return type:

dict

Laboratory Work #3 starter.

lab_3_ann_retriever.start.main() None

Launch an implementation.

lab_3_ann_retriever.start.open_files() tuple[list[str], list[str]]

# stubs: keep.

Open files.

Returns:

Documents and stopwords

Return type:

tuple[list[str], list[str]]