General Information

This technical track is aimed at building basic skills for retrieving data from external WWW resources and processing it for future linguistic research. The idea is to automatically obtain a dataset that has a certain structure and appropriate content, perform morphological analysis using various natural language processing (NLP) libraries. Dataset requirements.

Instructors:

Klimova Margarita Andreevna - linguistic track lecturer
Lyashevskaya Olga Nikolaevna - linguistic track lecturer
Demidovskij Alexander Vladimirovich - technical track lecturer
Uraev Dmitry Yurievich - technical track practice lecturer
Kazyulina Marina Sergeevna - technical track practice lecturer
Kashchikhin Andrei Nikolaevich - technical track expert
Zharikov Egor Igorevich - technical track assistant
Novikova Irina Alekseevna - technical track assistant
Blyudova Vasilisa Mikhailovna - technical track assistant
Zaytseva Vita Vyacheslavovna - technical track assistant

Project Timeline

Scrapper:
1. Short summary: Your code can automatically parse a media website you are going to choose, save texts and its metadata in a proper format.
2. Deadline: April, 29.
3. Format: each student works in their own PR.
4. Dataset volume: 5-7 articles.
5. Design document: Laboratory work №5. Retrieve raw data from World Wide Web.
Pipeline:
1. Short summary: Your code can automatically process raw texts from previous step, make point-of-speech tagging and basic morphological analysis.
2. Deadline: May, 27.
3. Format: each student works in their own PR.
4. Dataset volume: 5-7 articles.
5. Design document: Laboratory work №6. Process raw data.

Lectures history

Date	Lecture topic	Important links
01.04.2024	Lecture: Introduction to technical track.	Lab no. 5 description
01.04.2024	Seminar: Local setup. Choose website.	N/A
08.04.2024	Lecture: 3rd party libraries. Browser headers.	N/A
08.04.2024	Seminar: `requests`: install, API.	Листинг.
15.04.2024	Lecture: HTML structure. `bs4` library.	N/A
15.04.2024	Seminar: `bs4`: install, API.	Листинг.
22.04.2024	Lecture: Filesystem with `pathlib`, dates.	N/A
22.04.2024	Seminar: Filesystem with `pathlib`, dates.	Листинг 1. Листинг 2. Листинг 3.
29.04.2024	Introduction to lab 6. CoNLLU format.	N/A
29.04.2024	Lab 5 handover.	N/A
13.05.2024	Seminar: text analysis with `udpipe`, `stanza`.	Листинг. Листинг.
20.05.2024	Seminar: graph analysis with `networkx`,	Листинг.
27.05.2024	Lab 6 handover.	N/A
03.06.2024	Extra handover day (with penalties)	N/A

You can find a more complete summary from lectures in Short summary of lectures.

Technical solution

Module	Description	Component	Need to get
pathlib	working with file paths	scrapper	4
requests	downloading web pages	scrapper	4
BeautifulSoup4	finding information on web pages	scrapper	4
lxml	optional parsing HTML	scrapper	6
`datetime`	working with dates	scrapper	6
`json`	working with json text format	scrapper, pipeline	4
spacy_udpipe	module for morphological analysis	pipeline	6
stanza	module for morphological analysis	pipeline	8
networkx	working with graphs	pipeline	10

Software solution is built on top of three components:

scrapper.py - a module for finding articles from the given media, extracting text and dumping it to the file system. Students need to implement it.
pipeline.py - a module for processing text: point-of-speech tagging and basic morphological analysis. Students need to implement it.
article.py - a module for article abstraction to encapsulate low-level manipulations with the article.

Handing over your work

Lab work is accepted for oral presentation.
A student has explained the work of the program and showed it in action.
A student has completed the mini-task from a mentor that requires some slight code modifications.
A student receives a mark:
1. That corresponds to the expected one, if all the steps above are completed and mentor is satisfied with the answer.
2. One point bigger than the expected one, if all the steps above are completed and mentor is very satisfied with the answer.
3. One point smaller than the expected one, if a lab is handed over one week later than the deadline and criteria from 4.1 are satisfied.
4. Two points smaller than the expected one, if a lab is handed over more than one week later than the deadline and criteria from 4.1 are satisfied.

Note

A student might improve their mark for the lab, if they complete tasks of the next level after handing over the lab.

A lab work is accepted for oral presentation if all the criteria below are satisfied:

There is a Pull Request (PR) with a correctly formatted name: Scrapper, <NAME> <SURNAME> - <UNIVERSITY GROUP NAME>.
1. Example: Scrapper, Irina Novikova - 20FPL2.
Has a filled file settings.json with an expected mark. Acceptable values: 4, 6, 8, 10.
Has green status.
Has a label done, set by mentor.

Resources

Academic performance
Media websites list
Python programming course from previous semester
Scrapping tutorials (Russian)
Scrapping tutorials (English)
starting-guide-en-label
ctlr-tests-label
run-in-terminal-label
Frequently asked questions