Classification
Models
Model |
Lang |
---|---|
EN |
|
RU |
|
RU |
|
EN |
|
EN |
|
EN |
|
RU |
|
RU |
|
RU |
|
RU |
Datasets
-
Lang: EN
Rows: 31915
Preprocess:
Drop column
id
.Rename column
label
totarget
.Rename column
comment_text
tosource
.Reset indexes.
-
Lang: RU
Rows: 5430
Preprocess:
Select
simplified
subset.Drop columns
id
andtext
.Convert column
labels
to tuple.Remove from
labels
values0
,4
,5
,6
,7
,8
,10
,12
,15
,18
,21
,22
,23
.Rename column
labels
totarget
.Rename column
ru_text
tosource
.Group emotions and change numbers to words:
Labels
1
,13
,17
,20
change to label1
.Labels
9
,16
,24
,25
change to label2
.Labels
14
,19
change to label3
.Labels
2
,3
change to label4
.Labels
27
change to label7
.Labels
26
change to label6
.Other labels to label
8
.
Delete duplicates in
target
.Clean column
source
.Reset indexes.
papluca/language-identification
Lang: EN
Rows: 10000
Preprocess:
Rename column
labels
totarget
.Rename column
text
tosource
.Map language abbreviation to label classes.
Reset indexes.
-
Lang: EN
Rows: 7600
Preprocess:
Rename column
label
totarget
.Rename column
text
tosource
.Reset indexes.
-
Lang: EN
Rows: 25000
Preprocess:
Select
test
split.Rename column
labels
totarget
.Rename column
text
tosource
.Reset indexes.
-
Lang: EN
Rows: 2000
Preprocess:
Select
split
subset.Select
validation
split.Rename column
label
totarget
.Rename column
text
tosource
.Reset indexes.
-
Lang: RU
Rows: 36591
Preprocess:
Select
validation
split.Leave only
content
andgrade3
columns.Rename column
grade3
totarget
.Rename column
content
tosource
.Delete empty rows in dataset.
Map
target
with class labels.Reset indexes.
blinoff/healthcare_facilities_reviews
Lang: RU
Rows: 70597
Preprocess:
Select
validation
split.Leave only
content
andsentiment
columns.Rename column
sentiment
totarget
.Rename column
content
tosource
.Map
target
with class labels.
Note
In combination with a multiclass model blanchefort/rubert-base-cased-sentiment-rusentiment
it is necessary to bring the neutral
class to the negative
class at the prediction stage.
tatiana-merz/cyrillic_turkic_langs
Lang: RU
Rows: 9000
Preprocess:
Select
validation
split.Rename column
label
totarget
.Rename column
text
tosource
.Map
target
with class labels.
Lang: RU
Rows: 6350
Preprocess:
Rename column
toxic
totarget
.Rename column
neutral
tosource
.Delete duplicates in dataset.
Map
target
with class labels.Reset indexes.
Lang: RU
Rows: 163187
Preprocess:
Select
train
split.Rename column
toxic
totarget
.Rename column
text
tosource
.
Lang: RU
Rows: 20900
Preprocess:
Rename column
reasons
totarget
.Rename column
toxic_comment
tosource
.Rename
{"toxic_content":true}
label to1
and{"not_toxic":true}
label to0
.Remove irrelevant rows in dataset.
Delete duplicates in dataset.
Reset indexes.
Lang: EN
Rows: 6490
Preprocess:
Select
validation
split.Drop column
id
,severe_toxic
,obscene
,threat
,insult
,identity_hate
.Rename column
toxic
totarget
.Rename column
comment_text
tosource
.Reset indexes.
Lang: EN
Rows: 26507
Preprocess:
Select
train
split.Rename column
toxic
totarget
.Rename column
comment
tosource
.Reset indexes.
Supervised Fine-Tuning (SFT) Parameters
Note
Set the parameter target_modules=["query", "key", "value", "dense"]
for the
XSY/albert-base-v2-imdb-calssification model as SFT parameter.
Metrics
F1-score