General Information

This technical track is aimed at building basic skills for retrieving data from external WWW resources and processing it for future linguistic research. The idea is to automatically obtain a dataset that has a certain structure and appropriate content, perform morphological analysis using various natural language processing (NLP) libraries. Dataset requirements Dataset requirements.

Instructors:

Klimova Margarita Andreevna - linguistic track lecturer
Lyashevskaya Olga Nikolaevna - linguistic track lecturer
Demidovskij Alexander Vladimirovich - technical track lecturer
Uraev Dmitry Yurievich - technical track practice lecturer
Zharikov Egor Igorevich - technical track expert
Nurtdinova Sofia Alekseevna - technical track assistant
Podpryatova Anna Sergeevna - technical track assistant
Klimov Andrey Petrovich - technical track assistant
Evgrafova Anna Sergeevna - technical track assistant

Project Timeline

Scraper:
1. Short summary: Your code can automatically parse a media website you are going to choose, save texts and its metadata in a proper format.
2. Deadline: May, 11.
3. Format: each student works in their own PR.
4. Dataset volume: 100 articles.
5. Design document: Laboratory work №5. Retrieve raw data from World Wide Web.
Pipeline:
1. Short summary: Your code can automatically process raw texts from previous step, make point-of-speech tagging and basic morphological analysis.
2. Deadline: TBD.
3. Format: each student works in their own PR.
4. Dataset volume: 100 articles.
5. Design document: Laboratory work №6. Process raw data.

Lectures history

Date	Lecture topic	Important links
06.04.2024	Lecture: Introduction to technical track. 3rd party libraries.	N/A
13.04.2024	Lecture: Headers. HTML structure.	N/A
13.04.2024	Seminar: Local setup. Choose website.	Листинг.
20.04.2024	Lecture: Search in HTML page.	N/A
20.04.2024	Seminar: requests: install, API.	Листинг.

You can find a more complete summary from lectures in Short summary of lectures.

Technical solution

Module	Description	Component	Need to get
pathlib	working with file paths	scraper	4
requests	downloading web pages	scraper	4
BeautifulSoup4	finding information on web pages	scraper	4
lxml	optional parsing HTML	scraper	6
`datetime`	working with dates	scraper	6
`json`	working with json text format	scraper, pipeline	4
spacy_udpipe	module for morphological analysis	pipeline	6
networkx	working with graphs	pipeline	10

Software solution is built on top of three components:

scraper.py - a module for finding articles from the given media, extracting text and dumping it to the file system. Students need to implement it.
pipeline.py - a module for processing text: point-of-speech tagging and basic morphological analysis. Students need to implement it.
article.py - a module for article abstraction to encapsulate low-level manipulations with the article.

General Information

Instructors:

Project Timeline

Lectures history

Technical solution

Resources