Classification

Models

Model

Lang

cointegrated/rubert-tiny-toxicity

EN

cointegrated/rubert-tiny2-cedr-emotion-detection

RU

papluca/xlm-roberta-base-language-detection

RU

fabriceyhc/bert-base-uncased-ag_news

EN

XSY/albert-base-v2-imdb-calssification

EN

IlyaGusev/rubertconv_toxic_clf

EN

aiknowyou/it-emotion-analyzer

RU

blanchefort/rubert-base-cased-sentiment-rusentiment

RU

tatiana-merz/turkic-cyrillic-classifier

RU

s-nlp/russian_toxicity_classifier

RU

Datasets

  1. OxAISH-AL-LLM/wiki_toxic

    1. Lang: EN

    2. Rows: 31915

    3. Preprocess:

      1. Select validation split.

      2. Drop column id.

      3. Rename column label to target.

      4. Rename column comment_text to source.

      5. Reset indexes.

Note

When obtaining this dataset, pass the following parameters to the call of load_dataset:

  • revision="refs/convert/parquet"

  1. seara/ru_go_emotions

    1. Lang: RU

    2. Rows: 5430

    3. Preprocess:

      1. Select simplified subset.

      2. Select validation split.

      3. Convert column labels to tuple.

      4. Drop columns id and text.

      5. Remove from labels values 0, 4, 5, 6, 7, 8, 10, 12, 15, 18, 21, 22, 23.

      6. Rename column labels to target.

      7. Rename column ru_text to source.

      8. Group emotions (leave only one label per row):

        1. Labels 1, 13, 17, 20 change to label 1.

        2. Labels 9, 16, 24, 25 change to label 2.

        3. Labels 14, 19 change to label 3.

        4. Labels 2, 3 change to label 4.

        5. Labels 27 change to label 7.

        6. Labels 26 change to label 6.

        7. Other labels to label 8.

      9. Drop label 8 from target.

      10. Map target labels to sequential numbers: 1 to 0 (joy), 2 to 1 (sadness), 3 to 2 (fear), 4 to 3 (anger), 6 to 4 (neutral), 7 to 5 (other).

      11. Clean column source.

      12. Reset indexes.

  2. papluca/language-identification

    1. Lang: EN

    2. Rows: 10000

    3. Preprocess:

      1. Select validation split.

      2. Rename column labels to target.

      3. Rename column text to source.

      4. Map language abbreviation to label classes.

      5. Reset indexes.

  3. ag_news

    1. Lang: EN

    2. Rows: 7600

    3. Preprocess:

      1. Select test split.

      2. Rename column label to target.

      3. Rename column text to source.

      4. Reset indexes.

  4. imdb

    1. Lang: EN

    2. Rows: 25000

    3. Preprocess:

      1. Select test split.

      2. Rename column label to target.

      3. Rename column text to source.

      4. Reset indexes.

Note

When used with XSY/albert-base-v2-imdb-calssification model, use the parameter setting max_length=512.

  1. dair-ai/emotion

    1. Lang: EN

    2. Rows: 2000

    3. Preprocess:

      1. Select split subset.

      2. Select validation split.

      3. Rename column label to target.

      4. Rename column text to source.

      5. Reset indexes.

  2. blinoff/kinopoisk

    1. Lang: RU

    2. Rows: 36591

    3. Preprocess:

      1. Select train split.

      2. Leave only content and grade3 columns.

      3. Rename column grade3 to target.

      4. Rename column content to source.

      5. Delete empty rows in dataset.

      6. Map target with class labels.

      7. Reset indexes.

  3. blinoff/healthcare_facilities_reviews

    1. Lang: RU

    2. Rows: 70597

    3. Preprocess:

      1. Select validation split.

      2. Leave only content and sentiment columns.

      3. Rename column sentiment to target.

      4. Rename column content to source.

      5. Map target with class labels.

Note

When obtaining this dataset, pass the following parameters to the call of load_dataset:

  • revision="refs/convert/parquet"

Note

In combination with a multiclass model blanchefort/rubert-base-cased-sentiment-rusentiment it is necessary to bring the neutral class to the negative class at the prediction stage.

  1. tatiana-merz/cyrillic_turkic_langs

    1. Lang: RU

    2. Rows: 9000

    3. Preprocess:

      1. Select validation split.

      2. Rename column label to target.

      3. Rename column text to source.

      4. Map target with class labels.

  2. s-nlp/ru_paradetox_toxicity

  1. Lang: RU

  2. Rows: 6350

  3. Preprocess:

    1. Select train split.

    2. Rename column toxic to target.

    3. Rename column neutral to source.

    4. Delete duplicates in dataset.

    5. Map target with class labels.

    6. Reset indexes.

  1. d0rj/rudetoxifier_data

  1. Lang: RU

  2. Rows: 163187

  3. Preprocess:

    1. Select train split.

    2. Rename column toxic to target.

    3. Rename column text to source.

  1. s-nlp/ru_non_detoxified

  1. Lang: RU

  2. Rows: 20900

  3. Preprocess:

    1. Select train split.

    2. Rename column reasons to target.

    3. Rename column toxic_comment to source.

    4. Rename {"toxic_content":true} label to 1 and {"not_toxic":true} label to 0.

    5. Remove irrelevant rows in dataset.

    6. Delete duplicates in dataset.

    7. Reset indexes.

  1. Arsive/toxicity_classification_jigsaw

  1. Lang: EN

  2. Rows: 6490

  3. Preprocess:

    1. Select validation split.

    2. Drop column id, severe_toxic, obscene, threat, insult, identity_hate.

    3. Rename column toxic to target.

    4. Rename column comment_text to source.

    5. Reset indexes.

  1. s-nlp/en_paradetox_toxicity

  1. Lang: EN

  2. Rows: 26507

  3. Preprocess:

    1. Select train split.

    2. Rename column toxic to target.

    3. Rename column comment to source.

    4. Reset indexes.

Supervised Fine-Tuning (SFT) Parameters

Note

Metrics

  • F1-score

Important

Use average = "micro".