Classification

Models

Model	Lang
cointegrated/rubert-tiny-toxicity	EN
cointegrated/rubert-tiny2-cedr-emotion-detection	RU
papluca/xlm-roberta-base-language-detection	RU
fabriceyhc/bert-base-uncased-ag_news	EN
XSY/albert-base-v2-imdb-calssification	EN
IlyaGusev/rubertconv_toxic_clf	EN
aiknowyou/it-emotion-analyzer	RU
blanchefort/rubert-base-cased-sentiment-rusentiment	RU
tatiana-merz/turkic-cyrillic-classifier	RU
s-nlp/russian_toxicity_classifier	RU

Datasets

OxAISH-AL-LLM/wiki_toxic
1. Lang: EN
2. Rows: 31915
3. Preprocess:
  1. Select validation split.
  2. Drop column id.
  3. Rename column label to target.
  4. Rename column comment_text to source.
  5. Reset indexes.

Note

When obtaining this dataset, pass the following parameters to the call of load_dataset:

revision="refs/convert/parquet"

seara/ru_go_emotions
1. Lang: RU
2. Rows: 5430
3. Preprocess:
  1. Select simplified subset.
  2. Select validation split.
  3. Convert column labels to tuple.
  4. Drop columns id and text.
  5. Remove from labels values 0, 4, 5, 6, 7, 8, 10, 12, 15, 18, 21, 22, 23.
  6. Rename column labels to target.
  7. Rename column ru_text to source.
  8. Group emotions (leave only one label per row):
    1. Labels 1, 13, 17, 20 change to label 1.
    2. Labels 9, 16, 24, 25 change to label 2.
    3. Labels 14, 19 change to label 3.
    4. Labels 2, 3 change to label 4.
    5. Labels 27 change to label 7.
    6. Labels 26 change to label 6.
    7. Other labels to label 8.
  9. Drop label 8 from target.
  10. Map target labels to sequential numbers: 1 to 0 (joy), 2 to 1 (sadness), 3 to 2 (fear), 4 to 3 (anger), 6 to 4 (neutral), 7 to 5 (other).
  11. Clean column source.
  12. Reset indexes.
papluca/language-identification
1. Lang: EN
2. Rows: 10000
3. Preprocess:
  1. Select validation split.
  2. Rename column labels to target.
  3. Rename column text to source.
  4. Map language abbreviation to label classes.
  5. Reset indexes.
ag_news
1. Lang: EN
2. Rows: 7600
3. Preprocess:
  1. Select test split.
  2. Rename column label to target.
  3. Rename column text to source.
  4. Reset indexes.
imdb
1. Lang: EN
2. Rows: 25000
3. Preprocess:
  1. Select test split.
  2. Rename column label to target.
  3. Rename column text to source.
  4. Reset indexes.

Note

When used with XSY/albert-base-v2-imdb-calssification model, use the parameter setting max_length=512.

dair-ai/emotion
1. Lang: EN
2. Rows: 2000
3. Preprocess:
  1. Select split subset.
  2. Select validation split.
  3. Rename column label to target.
  4. Rename column text to source.
  5. Reset indexes.
blinoff/kinopoisk
1. Lang: RU
2. Rows: 36591
3. Preprocess:
  1. Select train split.
  2. Leave only content and grade3 columns.
  3. Rename column grade3 to target.
  4. Rename column content to source.
  5. Delete empty rows in dataset.
  6. Map target with class labels.
  7. Reset indexes.
blinoff/healthcare_facilities_reviews
1. Lang: RU
2. Rows: 70597
3. Preprocess:
  1. Select validation split.
  2. Leave only content and sentiment columns.
  3. Rename column sentiment to target.
  4. Rename column content to source.
  5. Map target with class labels.

Note

When obtaining this dataset, pass the following parameters to the call of load_dataset:

revision="refs/convert/parquet"

Note

In combination with a multiclass model blanchefort/rubert-base-cased-sentiment-rusentiment it is necessary to bring the neutral class to the negative class at the prediction stage.

tatiana-merz/cyrillic_turkic_langs
1. Lang: RU
2. Rows: 9000
3. Preprocess:
  1. Select validation split.
  2. Rename column label to target.
  3. Rename column text to source.
  4. Map target with class labels.
s-nlp/ru_paradetox_toxicity

Lang: RU

Rows: 6350

Preprocess:

Select train split.

Rename column toxic to target.

Rename column neutral to source.

Delete duplicates in dataset.

Map target with class labels.

Reset indexes.

d0rj/rudetoxifier_data

Lang: RU

Rows: 163187

Preprocess:

Select train split.

Rename column toxic to target.

Rename column text to source.

s-nlp/ru_non_detoxified

Lang: RU

Rows: 20900

Preprocess:

Select train split.

Rename column reasons to target.

Rename column toxic_comment to source.

Rename {"toxic_content":true} label to 1 and {"not_toxic":true} label to 0.

Remove irrelevant rows in dataset.

Delete duplicates in dataset.

Reset indexes.

Arsive/toxicity_classification_jigsaw

Lang: EN

Rows: 6490

Preprocess:

Select validation split.

Drop column id, severe_toxic, obscene, threat, insult, identity_hate.

Rename column toxic to target.

Rename column comment_text to source.

Reset indexes.

s-nlp/en_paradetox_toxicity

Lang: EN

Rows: 26507

Preprocess:

Select train split.

Rename column toxic to target.

Rename column comment to source.

Reset indexes.

Supervised Fine-Tuning (SFT) Parameters

Note

Set the parameters fine_tuning_steps=150, target_modules=["key"] for the tatiana-merz/turkic-cyrillic-classifier model as SFT parameters.
Set the parameter target_modules=["query", "key", "value", "dense"] for the XSY/albert-base-v2-imdb-calssification model as SFT parameter.
Set the parameter problem_type="single_label_classification", num_labels=6 for the cointegrated/rubert-tiny2-cedr-emotion-detection when initializing model instance. Set the parameters target_modules=["query", "key", "value", "dense"], rank=16, alpha=24 as its SFT parameters.
Set the parameters problem_type="single_label_classification", num_labels=5 for the OxAISH-AL-LLM/wiki_toxic dataset when initializing cointegrated/rubert-tiny-toxicity model instance.

Metrics

F1-score

Important

Use average = "micro".