General Information

This technical track is aimed at building basic skills for retrieving data from external WWW resources and processing it for future linguistic research. The idea is to automatically obtain a dataset that has a certain structure and appropriate content, perform morphological analysis using various natural language processing (NLP) libraries. Dataset requirements Dataset requirements.

Instructors:

Project Timeline

  1. Scraper:

    1. Short summary: Your code can automatically parse a media website you are going to choose, save texts and its metadata in a proper format.

    2. Deadline: May, 11.

    3. Format: each student works in their own PR.

    4. Dataset volume: 100 articles.

    5. Design document: Laboratory work №5. Retrieve raw data from World Wide Web.

  2. Pipeline:

    1. Short summary: Your code can automatically process raw texts from previous step, make point-of-speech tagging and basic morphological analysis.

    2. Deadline: TBD.

    3. Format: each student works in their own PR.

    4. Dataset volume: 100 articles.

    5. Design document: Laboratory work №6. Process raw data.

Lectures history

Date

Lecture topic

Important links

06.04.2024

Lecture: Introduction to technical track. 3rd party libraries.

N/A

13.04.2024

Lecture: Headers. HTML structure.

N/A

13.04.2024

Seminar: Local setup. Choose website.

Листинг.

20.04.2024

Lecture: Search in HTML page.

N/A

20.04.2024

Seminar: requests: install, API.

Листинг.

You can find a more complete summary from lectures in Short summary of lectures.

Technical solution

Module

Description

Component

Need to get

pathlib

working with file paths

scraper

4

requests

downloading web pages

scraper

4

BeautifulSoup4

finding information on web pages

scraper

4

lxml

optional parsing HTML

scraper

6

datetime

working with dates

scraper

6

json

working with json text format

scraper, pipeline

4

spacy_udpipe

module for morphological analysis

pipeline

6

networkx

working with graphs

pipeline

10

Software solution is built on top of three components:

  1. scraper.py - a module for finding articles from the given media, extracting text and dumping it to the file system. Students need to implement it.

  2. pipeline.py - a module for processing text: point-of-speech tagging and basic morphological analysis. Students need to implement it.

  3. article.py - a module for article abstraction to encapsulate low-level manipulations with the article.

Resources

  1. Academic performance

  2. Media websites list

  3. Python programming course from previous semester

  4. Scraping tutorials (Russian)

  5. Scraping tutorials (English)

  6. Useful documentation