Classification
Models
Model |
Lang |
|---|---|
EN |
|
RU |
|
RU |
|
EN |
|
EN |
|
EN |
|
RU |
|
RU |
|
RU |
|
RU |
Datasets
-
Lang: EN
Rows: 31915
Preprocess:
Drop column
id.Rename column
labeltotarget.Rename column
comment_texttosource.Reset indexes.
-
Lang: RU
Rows: 5430
Preprocess:
Select
simplifiedsubset.Drop columns
idandtext.Convert column
labelsto tuple.Remove from
labelsvalues0,4,5,6,7,8,10,12,15,18,21,22,23.Rename column
labelstotarget.Rename column
ru_texttosource.Group emotions and change numbers to words:
Labels
1,13,17,20change to label1.Labels
9,16,24,25change to label2.Labels
14,19change to label3.Labels
2,3change to label4.Labels
27change to label7.Labels
26change to label6.Other labels to label
8.
Delete duplicates in
target.Clean column
source.Reset indexes.
papluca/language-identification
Lang: EN
Rows: 10000
Preprocess:
Rename column
labelstotarget.Rename column
texttosource.Map language abbreviation to label classes.
Reset indexes.
-
Lang: EN
Rows: 7600
Preprocess:
Rename column
labeltotarget.Rename column
texttosource.Reset indexes.
-
Lang: EN
Rows: 25000
Preprocess:
Select
testsplit.Rename column
labelstotarget.Rename column
texttosource.Reset indexes.
-
Lang: EN
Rows: 2000
Preprocess:
Select
splitsubset.Select
validationsplit.Rename column
labeltotarget.Rename column
texttosource.Reset indexes.
-
Lang: RU
Rows: 36591
Preprocess:
Select
validationsplit.Leave only
contentandgrade3columns.Rename column
grade3totarget.Rename column
contenttosource.Delete empty rows in dataset.
Map
targetwith class labels.Reset indexes.
blinoff/healthcare_facilities_reviews
Lang: RU
Rows: 70597
Preprocess:
Select
validationsplit.Leave only
contentandsentimentcolumns.Rename column
sentimenttotarget.Rename column
contenttosource.Map
targetwith class labels.
Note
In combination with a multiclass model blanchefort/rubert-base-cased-sentiment-rusentiment
it is necessary to bring the neutral class to the negative class at the prediction stage.
tatiana-merz/cyrillic_turkic_langs
Lang: RU
Rows: 9000
Preprocess:
Select
validationsplit.Rename column
labeltotarget.Rename column
texttosource.Map
targetwith class labels.
Lang: RU
Rows: 6350
Preprocess:
Rename column
toxictotarget.Rename column
neutraltosource.Delete duplicates in dataset.
Map
targetwith class labels.Reset indexes.
Lang: RU
Rows: 163187
Preprocess:
Select
trainsplit.Rename column
toxictotarget.Rename column
texttosource.
Lang: RU
Rows: 20900
Preprocess:
Rename column
reasonstotarget.Rename column
toxic_commenttosource.Rename
{"toxic_content":true}label to1and{"not_toxic":true}label to0.Remove irrelevant rows in dataset.
Delete duplicates in dataset.
Reset indexes.
Lang: EN
Rows: 6490
Preprocess:
Select
validationsplit.Drop column
id,severe_toxic,obscene,threat,insult,identity_hate.Rename column
toxictotarget.Rename column
comment_texttosource.Reset indexes.
Lang: EN
Rows: 26507
Preprocess:
Select
trainsplit.Rename column
toxictotarget.Rename column
commenttosource.Reset indexes.
Supervised Fine-Tuning (SFT) Parameters
Note
Set the parameter target_modules=["query", "key", "value", "dense"] for the
XSY/albert-base-v2-imdb-calssification model as SFT parameter.
Metrics
F1-score