Working with UD format and `CoNLL-U`

UD (Universal Dependencies) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. All this annotation is usually stored in a format called CoNLL-U, that is a vertical, table-like format.

UD 

The UD format for storing morphological information is structured as follows: FeatureName=Value|FeatureName=Value|FeatureName=Value... where FeatureName is a name of the morphological feature of the token (for example, Number) and Value is the actual value of the feature (for example, Sing - short for singular).

For example:

Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing
- Animacy=Inan - inanimate
- Case=Acc - accusative case
- Degree=Pos - degree of comparison: positive/first degree
- Gender=Masc - masculine gender
- Number=Sing - singular number

Note

The list of all tags used in the UD system is available on the dedicated page.

`CoNLL-U` structure 

After processing articles, you should save annotated data in a .conllu file. It has the following structure, where each field is responsible for:

Field	Description	Library
ID	Word index, integer starting at 1 for each new word in the sentence	`spacy_udpipe`, `stanza`
FORM	Word form or punctuation symbol	`spacy_udpipe`, `stanza`
LEMMA	Lemma or stem of word form	`spacy_udpipe`, `stanza`
UPOS	Universal POS tag	`spacy_udpipe`, `stanza`
XPOS	Language-specific POS tag	`spacy_udpipe`, `stanza`
FEATS	List of morphological features structured as `FeatureName=Value\|FeatureName=Value\|FeatureName=Value...` as per UD format	`spacy_udpipe`, `stanza`
HEAD	Head of the current word	`spacy_udpipe`, `stanza`
DEPREL	Universal dependency relation to the HEAD	`spacy_udpipe`, `stanza`
DEPS	Enhanced dependency graph in the form of a list of head-deprel pairs	`spacy_udpipe`, `stanza`
MISC	Any other annotation	`stanza`

In addition, you must take into account that:

New sentences start with the token ID being 1;
Fields cannot be empty. If no value for a field, the _ is used;
Comments usually consist of the sentences and are denoted using #.

Let’s explain the first line 1 Красивая красивый ADJ _ Case=Nom|Degree=Pos|Gender=Fem|Number=Sing 3 amod _ _ from Desired output for mark 6:

1 - ID
Красивая - text of the token
красивый - lemma of the token
ADJ - POS
_ - language specific POS - none in this case
Case=Nom|Degree=Pos|Gender=Fem|Number=Sing - morphological features of the token as per tags:
- Case=Nom - nominative case
- Degree=Pos - degree of comparison: positive/first degree
- Gender=Fem - feminine gender
- Number=Sing - singular number
3 - the ID of the HEAD for the current token. HEAD is мама in this case
amod - relation to the HEAD token (amod - adjectival modifier as per tags)
_ - pair of HEAD:RELATION for the current token
_ - any other annotation (none in this case)
- For mark 8 you will have start_char=0|end_char=8 in this field. It denotes the start and end of the token in characters.

Attention

More information about the structure of the CoNLL-U format is available on the dedicated page.

Working with UD format and CoNLL-U

UD

CoNLL-U structure

Working with UD format and `CoNLL-U`

UD 

`CoNLL-U` structure 