Working with UD format and CoNLL-U
UD (Universal Dependencies) is a framework for consistent
annotation of grammar (parts of speech, morphological features, and
syntactic dependencies) across different human languages. All this
annotation is usually stored in a format called CoNLL-U, that is
a vertical, table-like format.
UD
The UD format for storing morphological information is structured as
follows: FeatureName=Value|FeatureName=Value|FeatureName=Value...
where FeatureName is a name of the morphological feature of the
token (for example, Number) and Value is the actual value of the
feature (for example, Sing - short for singular).
For example:
Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=SingAnimacy=Inan- inanimateCase=Acc- accusative caseDegree=Pos- degree of comparison: positive/first degreeGender=Masc- masculine genderNumber=Sing- singular number
Note
The list of all tags used in the UD system is available on the dedicated page.
CoNLL-U structure
After processing articles, you should save
annotated data in a .conllu file.
It has the following structure, where each field is responsible for:
Field |
Description |
Library |
|---|---|---|
ID |
Word index, integer starting at 1 for each new word in the sentence |
|
FORM |
Word form or punctuation symbol |
|
LEMMA |
Lemma or stem of word form |
|
UPOS |
Universal POS tag |
|
XPOS |
Language-specific POS tag |
|
FEATS |
List of morphological features
structured as
|
|
HEAD |
Head of the current word |
|
DEPREL |
Universal dependency relation to the HEAD |
|
DEPS |
Enhanced dependency graph in the form of a list of head-deprel pairs |
|
MISC |
Any other annotation |
|
In addition, you must take into account that:
New sentences start with the token ID being
1;Fields cannot be empty. If no value for a field, the
_is used;Comments usually consist of the sentences and are denoted using
#.
Let’s explain the first line
1 Красивая красивый ADJ _ Case=Nom|Degree=Pos|Gender=Fem|Number=Sing 3 amod _ _
from Desired output for mark 6:
1- IDКрасивая- text of the tokenкрасивый- lemma of the tokenADJ- POS_- language specific POS - none in this caseCase=Nom|Degree=Pos|Gender=Fem|Number=Sing- morphological features of the token as per tags:Case=Nom- nominative caseDegree=Pos- degree of comparison: positive/first degreeGender=Fem- feminine genderNumber=Sing- singular number
3- the ID of the HEAD for the current token. HEAD isмамаin this caseamod- relation to the HEAD token (amod- adjectival modifier as per tags)_- pair of HEAD:RELATION for the current token_- any other annotation (none in this case)For mark 8 you will have
start_char=0|end_char=8in this field. It denotes the start and end of the token in characters.
Attention
More information about the structure of the CoNLL-U
format is available on the dedicated
page.