Final project description

Each team is to work with the corresponding folder:

  • 22FPL1 team 1 - axmatova

  • 22FPL1 team 2 - balmont

  • 22FPL1 team 3 - blok

  • 22FPL1 team 4 - bunin

  • 22FPL1 team 5 - cvetaeva

  • 22FPL2 team 1 - mandelstamm

  • 22FPL2 team 2 - moritz

  • 22FPL2 team 3 - pasternak

  • 22FPL2 team 4 - silverage

All the files from the folder are to be joint into one .txt file and then processed using UDPipe.

The resulting .conllu file is to be opened in a text editor and then copied to an .xlsx file. To work with the file you will need the following columns in the table:

  • token number

  • token

  • lemma

  • part of speech

  • morphological characteristics

Please delete all the remaining columns provided by UDPipe and add three new columns:

  • comments to POS tags

  • comments to characteristics

  • comments to tokens

Using the .conllu file you are supposed to make up the word frequency dictionary. Once it is ready, you are welcome to work with the least popular word forms to find the mistakes in tokenization made by UDPipe. If any of the word forms turn out to be words joint with punctuation marks or words divided into tokens, you are supposed to correct the mistakes in the .conllu file and write comments in the “comments to tokens” column in the .xlsx file.

Next comes checking the table for mistakes in morphological annotation made by UDPipe. Whenever you come across any mistakes in either POS tags or characteristics, you are welcome to fill in the corresponding column with the comment in the .xlsx file and make corrections in the .conllu file. When making judgements as to the mistakes in morphological annotation, you are welcome to use the following sources:

  1. UD POS tags.

  2. UD POS features.

  3. UD Russian POS tags and features.

  4. Dictionaries of the Russian language containing morphological information.

Please do not forget to refer to the sources you use in the comments. You are supposed to stop working with the .conllu file and .xlsx file and send them to Klimova Margarita Andreevna together with the frequency dictionary in the .xlsx format by 16 June.

Klimova Margarita Andreevna will check your comments and corrections and suggest improvements, if any.

Note

Correctness of the .conllu file will be checked with a script taken from the repository with code from the Technical Track. You can also use it to check the .conllu file: run it locally and then, if it does not fail, send it to Klimova Margarita Andreevna.

This script can be run from PyCharm or PowerShell from a root of the project, like this: python admin_utils/final_project/checker.py PATH_TO_FILE. Your forks should contain this script already, so pull and use, otherwise, type to the chat and ask assistants.

For example, you have file final.conllu, place it in data folder:

|-- 2023-2-level-ctrl
    |-- data
        |-- final.conllu

Then you can run the checker script with (do not forget to activate environment and update PYTHONPATH):

python admin_utils/final_project/checker.py data/final.conllu

Meanwhile your task will be to prepare the exam presentation, which should include a report of the mistakes in tokenization and morphological annotation you came across - both a quantitative and qualitative (typology of mistakes, possible reasons for them) analysis. Time limit - 7 minutes.

The presentation is to be delivered at the exam. Assessment criteria:

  1. The proportion of the identified mistakes;

  2. The quality of their analysis in the comments section of the table and in the presentation;

  3. The precision of corrections made in the .conllu file;

  4. Following the time limit;

  5. The quality of the oral presentation (text learnt by heart, fluency and intelligibility of speech);

  6. The quality of the computer presentation;

  7. The quality of answers to follow-up questions.

Attention

The mark you get as a result will have a coefficient of 0,8. The remaining 20% of the exam mark belongs to the mark for working with КрякваБот.