Dataset requirements
For effective analysis of the collected articles, it is necessary to organize the data in a consistent way. The description of the structure and each of the elements of the dataset is provided below.
Structure
+-- 2023-2-level-ctlr
+-- tmp
+-- articles
+-- articles
+-- 1_raw.txt <- the raw text of the article with the ID as the name
+-- 1_meta.json <- the meta-information of the article
+-- 1_cleaned.txt <- lowercased text with no punctuation
+-- 1_udpipe_conllu.conllu <- processed text in the UD format (by UDPipe model)
+-- 1_stanza_conllu.conllu <- processed text in the UD format (by Stanza model)
+-- 1_image.png <- POS frequencies bar chart
+-- 2_raw.txt
+-- 2_meta.json
+-- 2_cleaned.txt
+-- 2_udpipe_conllu.conllu
+-- 2_stanza_conllu.conllu
+-- 2_image.png
+-- ...
+-- 100_raw.txt
+-- 100_meta.json
+-- 100_cleaned.txt
+-- 100_udpipe_conllu.conllu
+-- 100_stanza_conllu.conllu
+-- 100_image.png
Raw texts
Raw articles texts are stored in N_raw.txt
files where N
corresponds to the index of the article. The text is not preprocessed in
any way.
Example:
Красивая - мама красиво, училась в ПДД и ЖКУ по адресу Львовская 10 лет с почтой test .
Processed texts
Ideally, the dataset consists of three processed texts examples:
cleaned text
morphological and syntactic annotation from UDPipe model
morphological and syntactic annotation from Stanza model
Cleaned text
Cleaned texts are stored in N_cleaned.txt
files where N
corresponds to the index of the article.
Cleaned text is lowercased and does not include any punctuation. Word forms are the same as in the raw text.
Example:
красивая мама красиво училась в пдд и жку по адресу львовская 10 лет с почтой test
Morphological and syntactic annotation from UDPipe and Stanza models
Texts with morphological and syntactic annotation
from UDPipe and Stanza models are stored in N_udpipe_conllu.conllu
and N_stanza_conllu.conllu
files respectively
where N
corresponds to the index of the article.
The files contain the following information about the tags: ID
,
FORM
, LEMMA
, UPOS
, XPOS
, FEATS
, HEAD
, DEPREL
,
DEPS
, and MISC
.
Attention
Read more about the structure of such files in Working with UD format and CoNLL-U and look at the example files for UDPipe model and Stanza model.
Meta information
Meta information is stored in files with N_meta.json
names where
N
corresponds to the index of the article.
Meta-information includes:
Article id (it must match the id of the file)
Article URL
Article title
Article date
Article author
Article topics
Article POS frequencies (Lab 6 for mark 8)
Article pattern matches (Lab 6 for mark 10)
Example:
{
"id": 2,
"url": "https://www.nn.ru/text/style/2023/03/11/72125285/",
"title": "«Вы актер или батюшка?» Простой рабочий одевается как Пушкин и ходит так на оборонный завод",
"date": "2023-03-11 17:30:00",
"author": [
"Дарья Манохина"
],
"topics": [
"Стиль и красота"
],
"pos_frequencies": {},
"pattern_matches": {}
}
Volume
Aim at collecting not less than 100
articles from your chosen web source.