Classification

Models

Model	Lang
cointegrated/rubert-tiny-toxicity	EN
cointegrated/rubert-tiny2-cedr-emotion-detection	RU
papluca/xlm-roberta-base-language-detection	RU
fabriceyhc/bert-base-uncased-ag_news	EN
XSY/albert-base-v2-imdb-calssification	EN
IlyaGusev/rubertconv_toxic_clf	EN
aiknowyou/it-emotion-analyzer	RU
blanchefort/rubert-base-cased-sentiment-rusentiment	RU
tatiana-merz/turkic-cyrillic-classifier	RU
s-nlp/russian_toxicity_classifier	RU

Datasets

OxAISH-AL-LLM/wiki_toxic
1. Lang: EN
2. Rows: 31915
3. Preprocess:
  1. Drop column id.
  2. Rename column label to target.
  3. Rename column comment_text to source.
  4. Reset indexes.
seara/ru_go_emotions
1. Lang: RU
2. Rows: 5430
3. Preprocess:
  1. Select simplified subset.
  2. Drop columns id and text.
  3. Convert column labels to tuple.
  4. Remove from labels values 0, 4, 5, 6, 7, 8, 10, 12, 15, 18, 21, 22, 23.
  5. Rename column labels to target.
  6. Rename column ru_text to source.
  7. Group emotions and change numbers to words:
    1. Labels 1, 13, 17, 20 change to label 1.
    2. Labels 9, 16, 24, 25 change to label 2.
    3. Labels 14, 19 change to label 3.
    4. Labels 2, 3 change to label 4.
    5. Labels 27 change to label 7.
    6. Labels 26 change to label 6.
    7. Other labels to label 8.
  8. Delete duplicates in target.
  9. Clean column source.
  10. Reset indexes.
papluca/language-identification
1. Lang: EN
2. Rows: 10000
3. Preprocess:
  1. Rename column labels to target.
  2. Rename column text to source.
  3. Map language abbreviation to label classes.
  4. Reset indexes.
ag_news
1. Lang: EN
2. Rows: 7600
3. Preprocess:
  1. Rename column label to target.
  2. Rename column text to source.
  3. Reset indexes.
imdb
1. Lang: EN
2. Rows: 25000
3. Preprocess:
  1. Select test split.
  2. Rename column labels to target.
  3. Rename column text to source.
  4. Reset indexes.
dair-ai/emotion
1. Lang: EN
2. Rows: 2000
3. Preprocess:
  1. Select split subset.
  2. Select validation split.
  3. Rename column label to target.
  4. Rename column text to source.
  5. Reset indexes.
blinoff/kinopoisk
1. Lang: RU
2. Rows: 36591
3. Preprocess:
  1. Select validation split.
  2. Leave only content and grade3 columns.
  3. Rename column grade3 to target.
  4. Rename column content to source.
  5. Delete empty rows in dataset.
  6. Map target with class labels.
  7. Reset indexes.
blinoff/healthcare_facilities_reviews
1. Lang: RU
2. Rows: 70597
3. Preprocess:
  1. Select validation split.
  2. Leave only content and sentiment columns.
  3. Rename column sentiment to target.
  4. Rename column content to source.
  5. Map target with class labels.

Note

In combination with a multiclass model blanchefort/rubert-base-cased-sentiment-rusentiment it is necessary to bring the neutral class to the negative class at the prediction stage.

tatiana-merz/cyrillic_turkic_langs
1. Lang: RU
2. Rows: 9000
3. Preprocess:
  1. Select validation split.
  2. Rename column label to target.
  3. Rename column text to source.
  4. Map target with class labels.
s-nlp/ru_paradetox_toxicity

Lang: RU

Rows: 6350

Preprocess:

Rename column toxic to target.

Rename column neutral to source.

Delete duplicates in dataset.

Map target with class labels.

Reset indexes.

d0rj/rudetoxifier_data

Lang: RU

Rows: 163187

Preprocess:

Select train split.

Rename column toxic to target.

Rename column text to source.

s-nlp/ru_non_detoxified

Lang: RU

Rows: 20900

Preprocess:

Rename column reasons to target.

Rename column toxic_comment to source.

Rename {"toxic_content":true} label to 1 and {"not_toxic":true} label to 0.

Remove irrelevant rows in dataset.

Delete duplicates in dataset.

Reset indexes.

Arsive/toxicity_classification_jigsaw

Lang: EN

Rows: 6490

Preprocess:

Select validation split.

Drop column id, severe_toxic, obscene, threat, insult, identity_hate.

Rename column toxic to target.

Rename column comment_text to source.

Reset indexes.

s-nlp/en_paradetox_toxicity

Lang: EN

Rows: 26507

Preprocess:

Select train split.

Rename column toxic to target.

Rename column comment to source.

Reset indexes.

Supervised Fine-Tuning (SFT) Parameters

Note

Set the parameter target_modules=["query", "key", "value", "dense"] for the XSY/albert-base-v2-imdb-calssification model as SFT parameter.

Metrics

F1-score