Final project description
Each team is to work with the corresponding folder:
22FPL1 team 1 - axmatova
22FPL1 team 2 - balmont
22FPL1 team 3 - blok
22FPL1 team 4 - bunin
22FPL1 team 5 - cvetaeva
22FPL2 team 1 - mandelstamm
22FPL2 team 2 - moritz
22FPL2 team 3 - pasternak
22FPL2 team 4 - silverage
All the files from the folder are to be joint into one .txt
file
and then processed using UDPipe.
The resulting .conllu
file is to be opened in a text editor
and then copied to an .xlsx
file. To work with the file you will
need the following columns in the table:
token number
token
lemma
part of speech
morphological characteristics
Please delete all the remaining columns provided by UDPipe and add three new columns:
comments to POS tags
comments to characteristics
comments to tokens
Using the .conllu
file you are supposed to make up the word frequency dictionary.
Once it is ready, you are welcome to work with the least popular word forms
to find the mistakes in tokenization made by UDPipe. If any of the word forms turn out
to be words joint with punctuation marks or words divided into tokens, you are supposed
to correct the mistakes in the .conllu
file and write comments
in the “comments to tokens” column in the .xlsx
file.
Next comes checking the table for mistakes in morphological annotation made by UDPipe.
Whenever you come across any mistakes in either POS tags or characteristics,
you are welcome to fill in the corresponding column with the comment in the .xlsx
file
and make corrections in the .conllu
file. When making judgements as to the mistakes
in morphological annotation, you are welcome to use the following sources:
Dictionaries of the Russian language containing morphological information.
Please do not forget to refer to the sources you use in the comments.
You are supposed to stop working with the .conllu
file and .xlsx
file
and send them to Klimova Margarita Andreevna together with the frequency dictionary
in the .xlsx
format by 16 June.
Klimova Margarita Andreevna will check your comments and corrections and suggest improvements, if any.
Note
Correctness of the .conllu
file will be checked with a script
taken from the repository with code from the Technical Track. You can also use it
to check the .conllu
file: run it locally and then,
if it does not fail, send it to Klimova Margarita Andreevna.
This script can be run from PyCharm or PowerShell from a root of the project,
like this: python admin_utils/final_project/checker.py PATH_TO_FILE
.
Your forks should contain this script already, so pull and use, otherwise,
type to the chat and ask assistants.
For example, you have file final.conllu
, place it in data
folder:
|-- 2023-2-level-ctrl
|-- data
|-- final.conllu
Then you can run the checker script with (do not forget to activate
environment and update PYTHONPATH
):
python admin_utils/final_project/checker.py data/final.conllu
Meanwhile your task will be to prepare the exam presentation, which should include a report of the mistakes in tokenization and morphological annotation you came across - both a quantitative and qualitative (typology of mistakes, possible reasons for them) analysis. Time limit - 7 minutes.
The presentation is to be delivered at the exam. Assessment criteria:
The proportion of the identified mistakes;
The quality of their analysis in the comments section of the table and in the presentation;
The precision of corrections made in the .conllu file;
Following the time limit;
The quality of the oral presentation (text learnt by heart, fluency and intelligibility of speech);
The quality of the computer presentation;
The quality of answers to follow-up questions.
Attention
The mark you get as a result will have a coefficient of 0,8. The remaining 20% of the exam mark belongs to the mark for working with КрякваБот.