Classification
Models
Model |
Lang |
|---|---|
EN |
|
RU |
|
RU |
|
EN |
|
EN |
|
EN |
|
RU |
|
RU |
|
RU |
|
RU |
Datasets
-
Lang: EN
Rows: 31915
Preprocess:
Select
validationsplit.Drop column
id.Rename column
labeltotarget.Rename column
comment_texttosource.Reset indexes.
Note
When obtaining this dataset, pass the following parameters to the call of
load_dataset:
revision="refs/convert/parquet"
-
Lang: RU
Rows: 5430
Preprocess:
Select
simplifiedsubset.Select
validationsplit.Convert column
labelsto tuple.Drop columns
idandtext.Remove from
labelsvalues0,4,5,6,7,8,10,12,15,18,21,22,23.Rename column
labelstotarget.Rename column
ru_texttosource.Group emotions (leave only one label per row):
Labels
1,13,17,20change to label1.Labels
9,16,24,25change to label2.Labels
14,19change to label3.Labels
2,3change to label4.Labels
27change to label7.Labels
26change to label6.Other labels to label
8.
Drop label
8fromtarget.Map
targetlabels to sequential numbers:1to0(joy),2to1(sadness),3to2(fear),4to3(anger),6to4(neutral),7to5(other).Clean column
source.Reset indexes.
papluca/language-identification
Lang: EN
Rows: 10000
Preprocess:
Select
validationsplit.Rename column
labelstotarget.Rename column
texttosource.Map language abbreviation to label classes.
Reset indexes.
-
Lang: EN
Rows: 7600
Preprocess:
Select
testsplit.Rename column
labeltotarget.Rename column
texttosource.Reset indexes.
-
Lang: EN
Rows: 25000
Preprocess:
Select
testsplit.Rename column
labeltotarget.Rename column
texttosource.Reset indexes.
Note
When used with XSY/albert-base-v2-imdb-calssification
model, use the parameter setting max_length=512.
-
Lang: EN
Rows: 2000
Preprocess:
Select
splitsubset.Select
validationsplit.Rename column
labeltotarget.Rename column
texttosource.Reset indexes.
-
Lang: RU
Rows: 36591
Preprocess:
Select
trainsplit.Leave only
contentandgrade3columns.Rename column
grade3totarget.Rename column
contenttosource.Delete empty rows in dataset.
Map
targetwith class labels.Reset indexes.
blinoff/healthcare_facilities_reviews
Lang: RU
Rows: 70597
Preprocess:
Select
validationsplit.Leave only
contentandsentimentcolumns.Rename column
sentimenttotarget.Rename column
contenttosource.Map
targetwith class labels.
Note
When obtaining this dataset, pass the following parameters to the call of
load_dataset:
revision="refs/convert/parquet"
Note
In combination with a multiclass model blanchefort/rubert-base-cased-sentiment-rusentiment
it is necessary to bring the neutral class to the negative class at the prediction stage.
tatiana-merz/cyrillic_turkic_langs
Lang: RU
Rows: 9000
Preprocess:
Select
validationsplit.Rename column
labeltotarget.Rename column
texttosource.Map
targetwith class labels.
Lang: RU
Rows: 6350
Preprocess:
Select
trainsplit.Rename column
toxictotarget.Rename column
neutraltosource.Delete duplicates in dataset.
Map
targetwith class labels.Reset indexes.
Lang: RU
Rows: 163187
Preprocess:
Select
trainsplit.Rename column
toxictotarget.Rename column
texttosource.
Lang: RU
Rows: 20900
Preprocess:
Select
trainsplit.Rename column
reasonstotarget.Rename column
toxic_commenttosource.Rename
{"toxic_content":true}label to1and{"not_toxic":true}label to0.Remove irrelevant rows in dataset.
Delete duplicates in dataset.
Reset indexes.
Lang: EN
Rows: 6490
Preprocess:
Select
validationsplit.Drop column
id,severe_toxic,obscene,threat,insult,identity_hate.Rename column
toxictotarget.Rename column
comment_texttosource.Reset indexes.
Lang: EN
Rows: 26507
Preprocess:
Select
trainsplit.Rename column
toxictotarget.Rename column
commenttosource.Reset indexes.
Supervised Fine-Tuning (SFT) Parameters
Note
Set the parameters
fine_tuning_steps=150,target_modules=["key"]for the tatiana-merz/turkic-cyrillic-classifier model as SFT parameters.Set the parameter
target_modules=["query", "key", "value", "dense"]for the XSY/albert-base-v2-imdb-calssification model as SFT parameter.Set the parameter
problem_type="single_label_classification",num_labels=6for the cointegrated/rubert-tiny2-cedr-emotion-detection when initializing model instance. Set the parameterstarget_modules=["query", "key", "value", "dense"], rank=16, alpha=24as its SFT parameters.Set the parameters
problem_type="single_label_classification",num_labels=5for the OxAISH-AL-LLM/wiki_toxic dataset when initializing cointegrated/rubert-tiny-toxicity model instance.
Metrics
F1-score
Important
Use average = "micro".