Laboratory work №6. Process raw data

Full API

lab_6_pipeline package

Python competencies required to complete this tutorial:

working with external dependencies, going beyond Python standard library;

working with external modules: local and downloaded from PyPi;

working with files: create/read/update;

applying basic cleaning techniques to the raw text: tokenization, lemmatization etc.;

extracting linguistic features from the raw text: part of speech, case etc.

collecting statistics from text: POS frequency, grammar structures extraction etc.

Processing data breaks down in the following steps:

Loading raw data.
Cleaning the text.
Extracting valuable information from the text such as parts of speech, lemmas, syntactic positions etc.
Saving extracted information in the specified format.
Aggregating saved linguistic information to answer questions about data.

As a part of the second milestone, you need to implement processing logic as a pipeline.py module. When it is run as a standalone Python program, it should perform all aforementioned stages.

During this assignment you will be working with the UD text description format (.conllu extension).

Note

UD (Universal Dependencies) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. All this annotation is usually stored in a format called CONLL-U, that is a vertical, table-like format. More information about the structure of the CONLL-U format is available on the dedicated page.

Executing pipeline

Example execution (Windows):

python pipeline.py

Expected result:

N raw texts previously collected by scraper are processed.
Each article has a processed version (or versions) saved in the tmp/articles directory.

An example tmp directory content for mark 4:

+-- 2024-2-level-ctlr
    +-- tmp
        +-- articles
            +-- 1_raw.txt <- the paper with the ID (from scraper.py run)
            +-- 1_meta.json <- the paper meta-information (from scraper.py run)
            +-- 1_cleaned.txt <- processed text with no punctuation (by pipeline.py run)

An example tmp directory content for mark 6:

+-- 2024-2-level-ctlr
    +-- tmp
        +-- articles
            +-- 1_raw.txt <- the paper with the ID (from scraper.py run)
            +-- 1_meta.json <- the paper meta-information (from scraper.py run)
            +-- 1_cleaned.txt <- processed text with no punctuation (by pipeline.py run)
            +-- 1_udpipe_conllu.conllu <- processed text in the UD format (by pipeline.py run)

An example tmp directory content for marks 8 and 10:

+-- 2024-2-level-ctlr
    +-- tmp
        +-- articles
            +-- 1_raw.txt <- the paper with the ID (from scraper.py run)
            +-- 1_meta.json <- the paper meta-information (from scraper.py and pipeline.py run)
            +-- 1_cleaned.txt <- processed text with no punctuation (by pipeline.py run)
            +-- 1_udpipe_conllu.conllu <- processed text in the UD format (by pipeline.py run)
            +-- 1_stanza_conllu.conllu <- processed text in the UD format (by pipeline.py run, only for mark 10)
            +-- 1_image.png <- POS frequencies bar chart (by pipeline.py run)

Hint

When using CI (Continuous Integration), generated processed-dataset.zip is available in build artifacts. Go to Actions tab in GitHub UI of your fork, open the last job and if there is an artifact, you can download it.

Configuring pipeline

Processing behavior must follow several steps:

Pipeline takes a raw dataset that is collected by scraper.py and placed at ASSETS_PATH (see core_utils/constants.py for a particular place).
Pipeline goes through each raw file, for example 1_raw.txt.
Pipeline processes each text via specified UD model and extracts linguistic information.
Pipeline saves extracted information into the file with the same id for each article processed, for example 1_udpipe_conllu.conllu.

Assessment criteria

You state your ambition on the mark by editing the file settings.json. For example:

{
    "target_score": 6
}

would mean that you have made tasks for mark 6 and request mentors to check if you can get it.

Desired mark 4:
1. pylint level: 5/10.
2. Pipeline validates that raw dataset has a proper structure and fails appropriately if the latter is incorrect. Criteria:
  1. Dataset exists (there is a folder).
  2. Dataset is not empty (there are files inside).
  3. Dataset is balanced - there are only files that follow the naming conventions:
    1. N_raw.txt where N is a valid number.
    2. Numbers of articles are from 1 to N without any slips.
3. Pipeline cleans text in each file by removing punctuation and casting it to the lower case (no lemmatization or tagging).
4. Pipeline produces N_cleaned.txt files in the tmp/articles.
  1. Example raw text and Desired output.
Desired mark 6:
1. pylint level: 7/10.
2. All requirements for the mark 4.
3. Pipeline uses spacy-udpipe library to extract linguistic markup for the text in the CoNLL-U format.
4. Pipeline produces N_udpipe_conllu.conllu files for each article.
  1. Example raw text and Desired output.
Desired mark 8:
1. pylint level: 10/10.
2. All requirements for the mark 6.
3. Pipeline collects frequencies of POS in each text, extends N_meta.json files with this information and visualizes this distribution via bar chart saved as N_image.png files for each article.
  1. Example raw text, Desired CoNLL-U output, Example meta info and Desired bar chart.
Desired mark 10:
1. pylint level: 10/10.
2. All requirements for the mark 8.
3. Pipeline uses stanza library to extract linguistic markup for the text in the CoNLL-U format.
4. Pipeline produces N_stanza_conllu.conllu files for each article.
5. Pipeline uses networkx library to create a graph of dependencies to search for a required syntactic pattern.
6. Pipeline collects required syntactic patterns in each text and extends N_meta.json files with this information.
  1. Example raw text and Example meta info.

Implementation tactics

All logic for instantiating and using needed abstractions should be implemented in a special block of the module pipeline.py:

def main():
    print('Your code goes here')

if __name__ == '__main__':
    main()

Stage 0. Prerequisites

scraper.py implementation

You will not be able to start your implementation if there is no scraper.py implementation. Make sure you implemented and passed scraper.py assignment first.
Ensure you only use pathlib to work with file paths

As we discussed during lectures it is always better to have something designed specifically for the given task. Comparing os and pathlib modules, the latter is the one that is designed for most file system related operations. Make sure you use only pathlib in the code you write.

Important

Do not change modules external to your code, for example core_utils/article/article.py, consider them as available only via installation. If you see a way to improve external modules, propose them in a separate PR - mentors will review them separately and give you bonuses as any improvements are appreciated.

Stage 1. Introduce a corpus abstraction: `CorpusManager`

As we discussed multiple times, when we are working from our Python programs with the real world entities, we need to emulate their behavior by new abstractions. If we think of the Pipeline and consider the Single Responsibility Principle, we will quickly realize that it is not the responsibility of the Pipeline to know where the dataset files are located and how to read/write to them, etc. Therefore, we need a new abstraction to be responsible for such tasks. We call it lab_6_pipeline.pipeline.CorpusManager.

Stage 1.1. Introduce `CorpusManager` abstraction

The lab_6_pipeline.pipeline.CorpusManager is an entity that knows where the dataset is placed and what are the available files of this dataset.

It should be instantiated with the following instruction:

corpus_manager = CorpusManager(path_to_raw_txt_data=ASSETS_PATH)

The lab_6_pipeline.pipeline.CorpusManager instance validates the dataset provided and saves all the constructor arguments in attributes with corresponding names. Each instance should also have an additional attribute self._storage of a dictionary type and filled with information about the files. Read about the filling instructions in the Stage 1.3.

Note

Remember to use pathlib to create file path object.

Stage 1.2. Implement a method for a dataset validation

Pipeline expects that dataset is collected by scraper. It must not start working if dataset is invalid. The very first thing that should happen after lab_6_pipeline.pipeline.CorpusManager is instantiated is a dataset validation. Implement lab_6_pipeline.pipeline.CorpusManager._validate_dataset() method.

Note

Remember to use pathlib module in order to operate paths.

Note

Call this method during CorpusManager initialization.

When dataset is valid, method returns None. Otherwise:

One of the following errors is thrown:
- FileNotFoundError: file does not exist;
- NotADirectoryError: path does not lead to directory;
- InconsistentDatasetError: IDs contain slips, number of meta and raw files is not equal, files are empty;
  - For mark 4, check that dataset contains no slips in IDs of raw files and files are not empty.
- EmptyDirectoryError: directory is empty.
Script immediately finishes execution.

Note

While validating dataset, ignore files which names do not conform to the format.

Stage 1.3. Implement a method for filling files storage

During initialization of lab_6_pipeline.pipeline.CorpusManager, it should scan the provided folder path and register each dataset entry. All the storage is represented as self._storage attribute. Filling the storage should be done by executing lab_6_pipeline.pipeline.CorpusManager._scan_dataset() method.

Note

Call this method during initialization and save the results in self._storage attribute.

Attention

Can you explain why the name of the method starts with an underscore?

The method should contain logic for iterating over the content of the folder, finding all N_raw.txt files and creating core_utils.ctlr.article.article.Article instance for each file.

Note

The core_utils.ctlr.article.article.Article constructor expects URL as the first argument. It should be safe to pass None instead of the real URL. Pipeline does not need to know where was the article downloaded from. See Article package.

As it was stated before, self._storage attribute is just a dictionary. Keys are ids of the files, values are instances of the core_utils.ctlr.article.article.Article class. For example, pipeline finds a file 1_raw.txt. Then we put new pair to the storage:

self._storage[1] = Article(url=None, article_id=1)

Note

The lab_6_pipeline.pipeline.CorpusManager knows where are the files, it can easily find them by id, but it is not its responsibility to perform actual file reads and writes. See core_utils/article/io.py module for article save/read functionality.

Stage 1.4. Implement a method for retrieval of files storage

The self._storage attribute is not a part of lab_6_pipeline.pipeline.CorpusManager interface, therefore we need a special getter - a method that just returns a storage value. At this stage, you need to implement lab_6_pipeline.pipeline.CorpusManager.get_articles() method.

Attention

Can you explain why we might need getters?

Eventually, lab_6_pipeline.pipeline.CorpusManager should return a dictionary of core_utils.ctlr.article.article.Article instances via lab_6_pipeline.pipeline.CorpusManager.get_articles() method.

Stage 2. Introduce abstraction for processing texts: `TextProcessingPipeline`

To get a mark not lower than 4, your pipeline must perform basic text preprocessing:

Lowercase text.
Remove punctuation.

After implementation of preprocessing, your pipeline must save results in the files with the names following the pattern N_cleaned.txt. See examples for a better understanding: Raw text - Desired output.

Stage 2.1. Implement simplified logic of `TextProcessingPipeline`

All of the above stages are necessary for implementing simplified lab_6_pipeline.pipeline.TextProcessingPipeline abstraction. It takes the raw text of the article and saves the processed (lowercased with no punctuation) text to a file N_cleaned.txt. The abstraction should have self._corpus attribute which represents your lab_6_pipeline.pipeline.CorpusManager abstraction.

It should be instantiated with the following instruction:

pipeline = TextProcessingPipeline(corpus_manager)

It is executed with a simple interface method lab_6_pipeline.pipeline.TextProcessingPipeline.run() that you need to implement. Once executed, lab_6_pipeline.pipeline.TextProcessingPipeline.run() iterates through the available articles taken from lab_6_pipeline.pipeline.CorpusManager, performs basic preprocessing and writes processed text to files.

Note

It is mandatory to get articles with the lab_6_pipeline.pipeline.CorpusManager.get_articles() method.

Stage 2.2. Save the results of text preprocessing

Important

Stages 0-2.2 are required to get the mark 4.

It is mandatory to save generated text to file in the lab_6_pipeline.pipeline.TextProcessingPipeline.run() method. In order to do this, inspect the core_utils/article/io.py module. Use core_utils.ctlr.article.io.to_cleaned() function to save cleaned text to the appropriate folder. Call this function with the article instance you want to save text for.

The function generates a file with a name N_cleaned.txt, where N is the index of your article in the tmp/articles directory.

Stage 3. Extract linguistic markup using UDPipe model

To get a mark not lower than 6, your pipeline, in addition to mark 4 requirements, must perform morphological text analysis for each article using spacy-udpipe library and save the result in the file with the name following the pattern N_udpipe_conllu.conllu. See examples for a better understanding: Raw text - Desired output.

File with .conllu extension means that it corresponds to the UD format. Starting with the mark 6 you are required to save results of morphological text analysis in the UD format. For better understanding of the format fields refer to the dedicated page.

As all article text information storing and managing is done by the core_utils.ctlr.article.article.Article abstraction, see Article package before proceeding to the next stages.

Stage 3.1.1 Download model from GitHub releases

You are required to download the UDPipe model from HERE and place the model to lab_6_pipeline/assets/model.

Stage 3.1.2 Introduce `UDPipeAnalyzer` abstraction

Given that the present laboratory work implements text processing via more than one linguistic model (UDPipe and Stanza), there is a need for unified interface. For this reason, you are required to implement a wrapper abstraction over UDPipe model: lab_6_pipeline.pipeline.UDPipeAnalyzer. This abstraction is responsible for processing text and outputting its linguistic features in CoNLL-U format.

Notice that this class inherits from LibraryWrapper, which defines a specific set of methods to be present across all wrappers: _bootstrap, analyze, to_conllu, from_conllu and get_document, as well as self._analyzer attribute. In the following sections each field will be explained.

First, the wrapper should be instantiated with the following instruction:

udpipe_analyzer = UDPipeAnalyzer()

Wrapper does not accept any arguments during initialization, but calls protected method lab_6_pipeline.pipeline.UDPipeAnalyzer._bootstrap(), which is responsible for loading and setting up the UDPipe model.

Hint

Path to the required UDPipe model is stored in core_utils/constants.py module as UDPIPE_MODEL_PATH.

The lab_6_pipeline.pipeline.UDPipeAnalyzer._bootstrap() method must read the UDPipe model via spacy_udpipe library, add CoNLL-U formatter to the model pipeline and define specific configurations for the model.

Tip

Refer to the corresponding seminar materials or inspect the official repository of the library to learn more about appropriate configuration details.

Finally, lab_6_pipeline.pipeline.UDPipeAnalyzer._bootstrap() returns the resulting model, which is further stored in the protected self._analyzer attribute of lab_6_pipeline.pipeline.UDPipeAnalyzer instance.

Note

Naturally, methods of spacy-udpipe module return an instance of its own abstraction Language, not the instance of AbstractCoNLLUAnalyzer, as specified in the typing annotation of lab_6_pipeline.pipeline.UDPipeAnalyzer._bootstrap(). However, given that the laboratory work covers more than one language analyzer, it is necessary to unite all the different types of analyzer instances used. For this exact reason class AbstractCoNLLUAnalyzer is defined: it does not impose a special interface, as protocols of pipeline and library wrappers do, but simply indicates that this object is responsible for analyzing the language material. As for the interface of the analyzer object, it is responsibility of corresponding library wrapper to handle it.

Stage 3.2. Process text via `UDPipeAnalyzer` abstraction

Next, you are required to implement lab_6_pipeline.pipeline.UDPipeAnalyzer.analyze() method. It is a public method used to process texts into CoNLL-U formatted markup. The method accepts a list of strings and produces a list of strings.

This method uses self._analyzer attribute, which encloses the UDPipe model, to retrieve linguistic features of the text in a required format.

Tip

To learn more about the UDPipe model interface, refer to the corresponding seminar materials or inspect the official repository of the library.

Stage 3.3. Save linguistic markup via `UDPipeAnalyzer` abstraction

Finally, lab_6_pipeline.pipeline.UDPipeAnalyzer abstraction must possess a method for producing a file with .conllu extension with retrieved linguistic markup. Method lab_6_pipeline.pipeline.UDPipeAnalyzer.to_conllu() does not perform any analysis, but operates fields of core_utils.ctlr.article.article.Article instance.

The method accepts one instance of core_utils.ctlr.article.article.Article as an argument. It is presumed that the given article object has a filled attribute with CoNLL-U markup. The method thus uses interface of the core_utils.ctlr.article.article.Article instance to save the stored information into the N_udpipe_conllu.conllu file, where N corresponds to the identifier of the article.

Note

It is mandatory to use core_utils.ctlr.article.article.Article.get_file_path() and core_utils.ctlr.article.article.Article.get_conllu_info() methods.

Stage 3.4. Extend `TextProcessingPipeline` with morphological analysis logic

Important

Stages 0-3.4 are required to get the mark 6.

After implementing abstraction for linguistic data retrieval you need to define overall processing logic to fill article instances with markup and save processing result in the UD format. All processing and filling actions is the responsibility of pipeline. So you need to extend lab_6_pipeline.pipeline.TextProcessingPipeline.

For mark 6, apart from punctuation removal and casting to lowercase, you must implement the extraction of all information required by UD format and save it in the corresponding .conllu files. In other words, the execution of the lab_6_pipeline.pipeline.TextProcessingPipeline.run() method must result in producing both N_cleaned.txt and N_udpipe_conllu.conllu files for each available article.

In order to achieve that, first, make sure that during the instantiation of the pipeline an instance of lab_6_pipeline.pipeline.UDPipeAnalyzer model is accepted as an argument and saved to the _analyzer protected attribute.

Then, during the execution of lab_6_pipeline.pipeline.TextProcessingPipeline.run() method, apart from cleaning of each article perform the following:

Extract ConLLU formatted markup of the text via analyzer.
Store the extracted markup in a corresponding field of article instance.
Save the stored markup to .conllu file via analyzer interface.

Note

It is mandatory to use core_utils.ctlr.article.io.from_raw(), lab_6_pipeline.pipeline.UDPipeAnalyzer.analyze(), core_utils.ctlr.article.article.Article.set_conllu_info() and lab_6_pipeline.pipeline.UDPipeAnalyzer.to_conllu() methods.

Stage 4. Extract and visualize POS frequency statistics

We have just finished implementation of text processing pipeline. However, this is just the beginning of your linguistic research: you have the data and now it is time to start analyzing it, gaining insights, achieving a better understanding and finding hidden meanings. During this stage we will make a pipeline that will compute distribution of various parts of speech in our texts, visualize it and, maybe, provide us with better understanding of the text.

This is a sample result we are going to obtain:

Stage 4.1 Extend `UDPipeAnalyzer` with CoNLL-U parsing functionality

In order to process information stored in the .conllu files, you are required to implement method lab_6_pipeline.pipeline.UDPipeAnalyzer.from_conllu().

Their responsibility is the opposite of that of lab_6_pipeline.pipeline.UDPipeAnalyzer.to_conllu() method: it accepts article instance, derives the name of the file where their UD properties are stored, and converts contents to the lab_6_pipeline.pipeline.UDPipeAnalyzer via spacy_conll.parser.

Note

Note that UDPipeDocument, similarly to AbstractCoNLLUAnalyzer, is a protocol that is used to mimic the UDPipe document class of the respective library and help unify the interfaces for further usage with different libraries.

Stage 4.2. Introduce `POSFrequencyPipeline` abstraction

Now we are going to work with the lab_6_pipeline.pipeline.POSFrequencyPipeline class. The lab_6_pipeline.pipeline.POSFrequencyPipeline is instantiated in the similar manner as the lab_6_pipeline.pipeline.TextProcessingPipeline. During initialization it must accept an instance of CorpusManager and an instance of an analyzer. For marks 8 it is UDPipeAnalyzer, and for mark 10 it could also be StanzaAnalyzer, which would be described at later stages.

corpus_manager = CorpusManager(...)
analyzer = UDPipeAnalyzer()
visualizer = POSFrequencyPipeline(corpus_manager, analyzer)

During instantiation, provided instances of CorpusManager and analyzer class are stored in the self._corpus, self._analyzer attributes.

Stage 4.3. Implement core logic of `POSFrequencyPipeline`

Important

Stages 0-4.3 are required to get the mark 8.

The lab_6_pipeline.pipeline.POSFrequencyPipeline is executed with the same interface method lab_6_pipeline.pipeline.POSFrequencyPipeline.run() that you need to implement.

Once executed, lab_6_pipeline.pipeline.POSFrequencyPipeline.run():

Iterates through the available articles taken from lab_6_pipeline.pipeline.CorpusManager.
Retrieves UD information for each article via analyzer interface.
Calculates frequencies of each part of speech via protected method lab_6_pipeline.pipeline.POSFrequencyPipeline._count_frequencies(), which accepts article instance and returns the dictionary in the format {<POS>: <number of occurrences>}.
Writes them to the meta file via Article instance interface.
Visualizes frequencies in a form of images with names following convention N_image.png.

Note

It is mandatory to get articles with the lab_6_pipeline.pipeline.CorpusManager.get_articles() method and get UD information with the corresponding to the analyzer core_utils.ctlr.pipeline.LibraryWrapper.from_conllu() method.

Note

It is mandatory to use core_utils.ctlr.article.article.Article.get_file_path(), core_utils.ctlr.article.article.Article.set_pos_info() methods and core_utils.ctlr.article.io.to_meta(), core_utils.ctlr.article.io.from_meta() functions.

Attention

You have to create EmptyFileError exception class and to raise it when an article file is empty (you can add checks directly to core_utils.ctlr.pipeline.LibraryWrapper.from_conllu() methods).

Attention

Make sure that resulting meta files are valid: they must contain no more than one dictionary-like object.

For visualization, you need to use core_utils.ctlr.visualizer.visualize() function.

Sample usage:

visualize(article=article, path_to_save=ASSETS_PATH / '1_image.png')

Stage 5. Extract syntactic patterns using graphs

For a mark higher than 8, you are required to be able to perform searching for a certain pattern using information from .conllu files and the functionality of networkx library.

See examples for a better understanding: Raw text - Desired output.

Tip

To learn more about the networkx, inspect the official documentation of the library.

Stage 5.1. Introduce `PatternSearchPipeline` abstraction

Now we are going to work with the lab_6_pipeline.pipeline.PatternSearchPipeline class. During instantiation it must accept an instance of CorpusManager, instance of an analyzer and a tuple of POS tags for required syntactic pattern.

In this laboratory work we are interested in exploring “verb-preposition-noun dependency” models, that is why there would be the following instantiation:

corpus_manager = CorpusManager(...)
analyzer = UDPipeAnalyzer()
visualizer = PatternSearchPipeline(corpus_manager, analyzer, ("VERB", "NOUN", "ADP"))

During instantiation, provided instances of CorpusManager and analyzer class are stored in the self._corpus, self._analyzer attributes. A tuple of POS tags is stored in the self._node_labels attribute.

Stage 5.2. Make syntactic graphs via `PatternSearchPipeline` abstraction

In this laboratory work we are going to search for patterns using graphs of syntactic dependencies. This approach facilitates effortless scaling of pattern sizes, enabling the discovery of syntactic patterns of varying lengths and widths.

Note

The idea of the graph of syntactic dependencies is based on the Dependency grammar. It represents the structure of a sentence in the form of a hierarchy of components between which dependency relationships are established. Thus, sentence structure is considered in terms of vertices (roots) and dependents (children).

Before looking for a patterns, we have to create syntactic graph of each sentence in the article which we will subsequently search for.

Example of the graph for the sentence: Я учусь в университете.

In order to make a graph of syntactic dependencies for each sentence in the article, you are required to implement a lab_6_pipeline.pipeline.PatternSearchPipeline._make_graphs() method.

The method accepts one instance of core_utils.ctlr.pipeline.CoNLLUDocument as an argument. It is presumed that the given document object contains information from .conllu file which was obtained using core_utils.ctlr.pipeline.LibraryWrapper.from_conllu() method.

The lab_6_pipeline.pipeline.PatternSearchPipeline._make_graphs() method iterates through each sentence in the article and creates nodes and edges for the words in the sentence. Pass upos as label argument when creating node and deprel as a label argument when creating edge. Edges should connect this word with its parent using dependency relation.

Tip

To make a graph it is mandatory to use an instance of DiGraph class.

Note

core_utils.pipeline.CoNLLUDocument is a protocol class that mimics and inherits protocols for analyzers’ document classes. It is used to unify the interfaces of analyzers and simplify the typing across the classes based on the same protocols. You can notice it, for example, in the interface difference of lab_6_pipeline.pipeline.UDPipeAnalyzer.analyze(), lab_6_pipeline.pipeline.StanzaAnalyzer.analyze() and their mutual protocol interface of core_utils.ctlr.pipeline.LibraryWrapper.analyze(). It makes for a clearer distinction in what is allowed in which exact case without disjunction usage in typing.

Stage 5.3. Find syntactic patterns via `PatternSearchPipeline` abstraction

After creating graphs of syntactic dependencies for each sentence in the article, we are ready to search for the required pattern in it.

The key idea is the following: we have the dependency graph of our sentence and we want to make an ideal graph for our pattern.

Note

Your implementation on creating an ideal graph should be scalable, which means that it is not tied to a specific syntactic pattern.

Example of the ideal graph for our syntactic pattern:

After that, we are searching for the subgraphs in our graph which are isomorphic to the ideal graph which represents our syntactic pattern.

Note

Isomorphic graphs are graphs that have the same structure, meaning they have the same number of vertices (points) and edges (connections between points), and these connections are arranged in the same way. The simple example is two different maps of the same city that show the same streets and intersections but might label them differently.

After we have found all isomorphic graphs, we instantiate the GraphMatcher class. During instantiation it takes an isomorphic ideal graph and subgraph and an anonymous function to fill the node_match argument with the logic of comparing labels of these graphs. This way we can find all subgraphs which match with required POS tags of the syntactic pattern.

However, it is not everything. Graphs are just a convenient way to search for a certain pattern. If we want to further analyze information about the received patterns or calculate some statistics we need to use a more traditional way of storing information.

For these aims there are core_utils.ctlr.pipeline.TreeNode class. It stores information about the node of the tree. You have to instantiate it with the POS tag, text and list of dependent children of your node.

To add all children to this list you have to implement a lab_6_pipeline.pipeline.PatternSearchPipeline._add_children() method. It accepts the current graph - an instance of DiGraph class, a dictionary with matched subgraphs, ID of the root node and root node of the matched subgraph. It iterates through the available children of the accepted root node, instantiates core_utils.ctlr.pipeline.TreeNode class and appends information about the new children into root node.

The method works recursively: until all the children of the current root node will not be found.

Note

Recursion is a method where a function calls itself to solve a problem. It’s like a loop that repeats an action, but instead of using a traditional loop structure, the function keeps calling itself with a modified parameter until it reaches a condition that tells it to stop, known as the base case.

# A classic example of recursion is calculating the factorial of a number

def calculate_factorial(number):
    if number == 1: # Base case
        return 1
    else:
        return number * calculate_factorial(number - 1) # Recursive case

The lab_6_pipeline.pipeline.PatternSearchPipeline._find_pattern() method returns a dictionary which as keys contains the indexes of sentences where the required pattern was found, and as values a list of core_utils.ctlr.pipeline.TreeNode class instances with information about the pattern matches.

Stage 5.4. Implement core logic of `PatternSearchPipeline`

The lab_6_pipeline.pipeline.PatternSearchPipeline is executed with the same interface method lab_6_pipeline.pipeline.PatternSearchPipeline.run() that you need to implement.

Once executed, lab_6_pipeline.pipeline.PatternSearchPipeline.run():

Iterates through the available articles taken from lab_6_pipeline.pipeline.CorpusManager.
Retrieves UD information for each article via analyzer interface.
Makes graphs of syntactic dependencies for each article via protected method lab_6_pipeline.pipeline.PatternSearchPipeline._make_graphs().
Searches for the required pattern for each article via protected method lab_6_pipeline.pipeline.PatternSearchPipeline._find_pattern().
Writes them to the meta file via Article instance interface.

Note

It is mandatory to get articles with the lab_6_pipeline.pipeline.CorpusManager.get_articles() method and get UD information with the corresponding to the analyzer core_utils.ctlr.pipeline.LibraryWrapper.from_conllu() method.

Note

It is mandatory to use core_utils.ctlr.article.article.Article.get_file_path(), core_utils.ctlr.article.article.Article.set_pos_info() method and core_utils.ctlr.article.io.to_meta() function.

Attention

Make sure that resulting meta files are valid: they must contain no more than one dictionary-like object.

Stage 6. Extract linguistic markup using Stanza model

There are other models that can be used for extraction of linguistic features in CoNNL-U format and that you might encounter in your linguistic research. It might be valuable to understand the differences in the performances of the models and be familiar with their distinctive features.

By now, you have already implemented all text and linguistic feature processing with a UDPipe model. As an additional challenge of mark 10, you are required to be able to perform processing by also using Stanza model.

See examples for a better understanding of Stanza model output: Raw text - Desired output.

Tip

Refer to the corresponding seminar materials or inspect the official repository of the library. to learn more about stanza interface details.

Stage 6.1. Introduce `StanzaAnalyzer` abstraction

Implement lab_6_pipeline.pipeline.StanzaAnalyzer abstraction. It is a wrapper over a Stanza model. Similarly, its responsibility is processing text and outputting its linguistic features in CoNLL-U format.

Notice that this wrapper inherits from LibraryWrapper protocol, which means that its interface is dictated by the protocol and identical to that of lab_6_pipeline.pipeline.UDPipeAnalyzer abstraction. In other words, lab_6_pipeline.pipeline.StanzaAnalyzer and lab_6_pipeline.pipeline.UDPipeAnalyzer have the same function and interface, meaning that they can be used interchangeably, but their inner workings are different as they are based on separate models. Thus, the resulting processing may also vary.

The lab_6_pipeline.pipeline.StanzaAnalyzer wrapper should be instantiated with the following instruction:

stanza_analyzer = StanzaAnalyzer()

Wrapper does not accept any arguments during initialization, but calls protected method lab_6_pipeline.pipeline.StanzaAnalyzer._bootstrap(), which is responsible for downloading and initializing the Stanza model.

The lab_6_pipeline.pipeline.StanzaAnalyzer._bootstrap() method must download via stanza library a model for the Russian language with the following functionality:

tokenization

lemmatization

part of speech extraction

dependency parsing

The method then initializes an instance of Stanza model via stanza.pipeline.core module functionality and returns it. The model returned is further stored in the protected self._analyzer attribute of lab_6_pipeline.pipeline.StanzaAnalyzer instance.

Tip

Refer to the corresponding seminar materials or inspect the official repository of the library. to learn more about stanza interface details.

Stage 6.2. Process text via `StanzaAnalyzer` abstraction

Next, you are required to implement lab_6_pipeline.pipeline.StanzaAnalyzer.analyze() method. It is a public method used to process texts into CoNLL-U formatted markup. The method accepts a list of strings and produces a list of StanzaDocument instances.

Note

Note that StanzaDocument, similarly to UDPipeDocument, is a protocol that is used to mimic the Stanza document class of the stanza library and help unify the interfaces for further usage with different libraries.

This method uses self._analyzer attribute, which encloses the Stanza model, to retrieve linguistic features of the text.

Tip

To learn more about the Stanza model interface, refer to the corresponding seminar materials or inspect the official repository of the library.

Stage 6.3. Save linguistic markup via `StanzaAnalyzer` abstraction

Finally, lab_6_pipeline.pipeline.StanzaAnalyzer abstraction must possess a method for producing a file with .conllu extension with retrieved linguistic markup. Implement method lab_6_pipeline.pipeline.StanzaAnalyzer.to_conllu(), which does not perform any analysis but operates fields of core_utils.ctlr.article.article.Article instance along with functions from stanza.utils.conll.

Tip

To learn more about the Stanza library interface, refer to the corresponding seminar materials or inspect the official repository of the library.

The method accepts one instance of core_utils.ctlr.article.article.Article as an argument. It is presumed that the given article object has a filled attribute with CoNLL-U markup. The method thus uses interface of the core_utils.ctlr.article.article.Article instance to save the stored UD information into the N_stanza_conllu.conllu file, where N corresponds to the identifier of the article.

Tip

It is mandatory to use core_utils.ctlr.article.article.Article.get_file_path() and core_utils.ctlr.article.article.Article.get_conllu_info() methods.

Stage 6.4. Ensure compatibility of `TextProcessingPipeline` with `StanzaAnalyzer`

Since both lab_6_pipeline.pipeline.StanzaAnalyzer and lab_6_pipeline.pipeline.UDPipeAnalyzer derive from the same interface protocol and carry the same functionality, there is no need to adjust lab_6_pipeline.pipeline.TextProcessingPipeline logic to working with Stanza wrapper. Perform a self-check by ensuring that your code works accordingly to the following examples.

This code snippet must produce N_cleaned.txt and N_stanza_conllu.conllu files for each available article:

corpus_manager = CorpusManager(path_to_raw_txt_data=ASSETS_PATH)
stanza_analyzer = StanzaAnalyzer()
pipeline = TextProcessingPipeline(corpus_manager, stanza_analyzer)
pipeline.run()

This code snippet must produce N_cleaned.txt and N_udpipe_conllu.conllu files for each available article:

corpus_manager = CorpusManager(path_to_raw_txt_data=ASSETS_PATH)
udpipe_analyzer = UDPipeAnalyzer()
pipeline = TextProcessingPipeline(corpus_manager, udpipe_analyzer)
pipeline.run()

Finally, this code snippet must produce just the N_cleaned.txt files for each available article, with no .conllu files:

corpus_manager = CorpusManager(path_to_raw_txt_data=ASSETS_PATH)
pipeline = TextProcessingPipeline(corpus_manager)
pipeline.run()

If you encounter errors from using lab_6_pipeline.pipeline.TextProcessingPipeline with lab_6_pipeline.pipeline.StanzaAnalyzer, then you must have made a mistake during either implementation of the model wrapper or markup extraction during pipeline execution.

Note

All model wrappers must have identical interface, specified by LibraryWrapper, and processing pipeline must only rely on that interface. Pipeline must not rely on any library specific attributes or methods.

Stage 6.5. Extend `StanzaAnalyzer` with CoNLL-U parsing functionality

Once again, in order to process information stored in the .conllu files, you are required to implement method lab_6_pipeline.pipeline.StanzaAnalyzer.from_conllu(), which shares the the responsibility with lab_6_pipeline.pipeline.UDPipeAnalyzer.from_conllu() but is used to convert contents of .conllu files to StanzaDocument via stanza.utils.conll functions.

Tip

To learn more about the Stanza library interface, refer to the corresponding seminar materials or inspect the official repository of the library.

Stage 6.6. Extend `UDPipeAnalyzer` and `StanzaAnalyzer` with ability to return unified documents

Now that you are familiar with both libraries and their respective document formats (UDPipeDocument and StanzaDocument), you might have discovered that they have slightly different interfaces: generally their functions are the same, however, when it comes to the names of attributes, they differ. This is disadvantageous in those cases when we want to be able to work with any .conllu file contents comfortably. It would take more time and resources to majorly alter Pipelines to work with both types of analyzers (imagine if we introduce yet another analyzer and have to add one more condition to all the places where we are using document interfaces).

As a solution to this problem, you are required to implement methods lab_6_pipeline.pipeline.UDPipeAnalyzer.get_document() and lab_6_pipeline.pipeline.StanzaAnalyzer.get_document() that are given the documents of respective origin (UDPipeDocument and StanzaDocument) and produce a unified document — an instance of core_utils.ctlr.pipeline.UnifiedCoNLLUDocument which contains the necessary information from the document: sentences as instances of core_utils.ctlr.pipeline.ConLLUSentence and each word in sentence with its linguistic properties saved in its attributes as an instance of core_utils.ctlr.pipeline.ConLLUWord.

Stage 6.7. Ensure compatibility of `POSFrequencyPipeline` with `StanzaAnalyzer`

Likewise TextProcessingPipeline, POSFrequencyPipeline should be able to accept instances of both analyzer classes. With an addition of core_utils.ctlr.pipeline.LibraryWrapper.get_document() methods, you still have to partially rework implementation of lab_6_pipeline.pipeline.POSFrequencyPipeline._count_frequencies() for the POSFrequencyPipeline to work successfully with both models, but this change will not be as global as adding the ability to work with different analyzers to the Pipeline itself (for example, by using conditions).

Note

Adding such an additional ability to the Pipeline also would not comply with single responsibility principle of OOP — Pipeline does not have to check what type of analyzer it uses.

Note

From Stage 6.7 onwards, it is required to use the analyzer method core_utils.ctlr.pipeline.LibraryWrapper.get_document() to get frequency from the unified document.

Check that the result of running both of the following snippets produces N_image.png and extends N_meta.json with POS frequencies information.

With UDPipeAnalyzer usage:

corpus_manager = CorpusManager(path_to_raw_txt_data=ASSETS_PATH)
udpipe_analyzer = UDPipeAnalyzer()
visualizer = POSFrequencyPipeline(corpus_manager, udpipe_analyzer)
visualizer.run()

With StanzaAnalyzer usage:

corpus_manager = CorpusManager(path_to_raw_txt_data=ASSETS_PATH)
stanza_analyzer = StanzaAnalyzer()
visualizer = POSFrequencyPipeline(corpus_manager, stanza_analyzer)
visualizer.run()

Stage 6.8. Ensure compatibility of `PatternSearchPipeline` with `StanzaAnalyzer`

Important

Stages 0-6.8 are required to get the mark 10.

Finally, you have to rework the implementation of lab_6_pipeline.pipeline.PatternSearchPipeline._make_graphs() method to be compatible with StanzaAnalyzer.

Note

For Stage 6.8, it is mandatory to use the analyzer method core_utils.ctlr.pipeline.LibraryWrapper.get_document() to create graphs.

Don’t forget to check that the result of running both of the following snippets produces and extends N_meta.json with POS frequencies information.

With UDPipeAnalyzer usage:

corpus_manager = CorpusManager(path_to_raw_txt_data=ASSETS_PATH)
udpipe_analyzer = UDPipeAnalyzer()
pattern_searcher = PatternSearchPipeline(
     corpus_manager, udpipe_analyzer, ("VERB", "NOUN", "ADP")
)
pattern_searcher.run()

With StanzaAnalyzer usage:

corpus_manager = CorpusManager(path_to_raw_txt_data=ASSETS_PATH)
stanza_analyzer = StanzaAnalyzer()
pattern_searcher = PatternSearchPipeline(
     corpus_manager, stanza_analyzer, ("VERB", "NOUN", "ADP")
)
pattern_searcher.run()

Laboratory work №6. Process raw data

Executing pipeline

Configuring pipeline

Assessment criteria

Implementation tactics

Stage 0. Prerequisites

Stage 1. Introduce a corpus abstraction: CorpusManager

Stage 1.1. Introduce CorpusManager abstraction

Stage 1.2. Implement a method for a dataset validation

Stage 1.3. Implement a method for filling files storage

Stage 1.4. Implement a method for retrieval of files storage

Stage 2. Introduce abstraction for processing texts: TextProcessingPipeline

Stage 2.1. Implement simplified logic of TextProcessingPipeline

Stage 2.2. Save the results of text preprocessing

Stage 3. Extract linguistic markup using UDPipe model

Stage 3.1.1 Download model from GitHub releases

Stage 3.1.2 Introduce UDPipeAnalyzer abstraction

Stage 3.2. Process text via UDPipeAnalyzer abstraction

Stage 3.3. Save linguistic markup via UDPipeAnalyzer abstraction

Stage 3.4. Extend TextProcessingPipeline with morphological analysis logic

Stage 4. Extract and visualize POS frequency statistics

Stage 4.1 Extend UDPipeAnalyzer with CoNLL-U parsing functionality

Stage 4.2. Introduce POSFrequencyPipeline abstraction

Stage 4.3. Implement core logic of POSFrequencyPipeline

Stage 5. Extract syntactic patterns using graphs

Stage 5.1. Introduce PatternSearchPipeline abstraction

Stage 5.2. Make syntactic graphs via PatternSearchPipeline abstraction

Stage 5.3. Find syntactic patterns via PatternSearchPipeline abstraction

Stage 5.4. Implement core logic of PatternSearchPipeline

Stage 6. Extract linguistic markup using Stanza model

Stage 6.1. Introduce StanzaAnalyzer abstraction

Stage 6.2. Process text via StanzaAnalyzer abstraction

Stage 6.3. Save linguistic markup via StanzaAnalyzer abstraction

Stage 6.4. Ensure compatibility of TextProcessingPipeline with StanzaAnalyzer

Stage 6.5. Extend StanzaAnalyzer with CoNLL-U parsing functionality

Stage 6.6. Extend UDPipeAnalyzer and StanzaAnalyzer with ability to return unified documents

Stage 6.7. Ensure compatibility of POSFrequencyPipeline with StanzaAnalyzer

Stage 6.8. Ensure compatibility of PatternSearchPipeline with StanzaAnalyzer

Stage 1. Introduce a corpus abstraction: `CorpusManager`

Stage 1.1. Introduce `CorpusManager` abstraction

Stage 2. Introduce abstraction for processing texts: `TextProcessingPipeline`

Stage 2.1. Implement simplified logic of `TextProcessingPipeline`

Stage 3.1.2 Introduce `UDPipeAnalyzer` abstraction

Stage 3.2. Process text via `UDPipeAnalyzer` abstraction

Stage 3.3. Save linguistic markup via `UDPipeAnalyzer` abstraction

Stage 3.4. Extend `TextProcessingPipeline` with morphological analysis logic

Stage 4.1 Extend `UDPipeAnalyzer` with CoNLL-U parsing functionality

Stage 4.2. Introduce `POSFrequencyPipeline` abstraction

Stage 4.3. Implement core logic of `POSFrequencyPipeline`

Stage 5.1. Introduce `PatternSearchPipeline` abstraction

Stage 5.2. Make syntactic graphs via `PatternSearchPipeline` abstraction

Stage 5.3. Find syntactic patterns via `PatternSearchPipeline` abstraction

Stage 5.4. Implement core logic of `PatternSearchPipeline`

Stage 6.1. Introduce `StanzaAnalyzer` abstraction

Stage 6.2. Process text via `StanzaAnalyzer` abstraction

Stage 6.3. Save linguistic markup via `StanzaAnalyzer` abstraction

Stage 6.4. Ensure compatibility of `TextProcessingPipeline` with `StanzaAnalyzer`

Stage 6.5. Extend `StanzaAnalyzer` with CoNLL-U parsing functionality

Stage 6.6. Extend `UDPipeAnalyzer` and `StanzaAnalyzer` with ability to return unified documents

Stage 6.7. Ensure compatibility of `POSFrequencyPipeline` with `StanzaAnalyzer`

Stage 6.8. Ensure compatibility of `PatternSearchPipeline` with `StanzaAnalyzer`