Classification

Models

Model

Lang

cointegrated/rubert-tiny-toxicity

EN

cointegrated/rubert-tiny2-cedr-emotion-detection

RU

papluca/xlm-roberta-base-language-detection

RU

fabriceyhc/bert-base-uncased-ag_news

EN

XSY/albert-base-v2-imdb-calssification

EN

IlyaGusev/rubertconv_toxic_clf

EN

aiknowyou/it-emotion-analyzer

RU

blanchefort/rubert-base-cased-sentiment-rusentiment

RU

tatiana-merz/turkic-cyrillic-classifier

RU

s-nlp/russian_toxicity_classifier

RU

Datasets

  1. OxAISH-AL-LLM/wiki_toxic

    1. Lang: EN

    2. Rows: 31915

    3. Preprocess:

      1. Drop column id.

      2. Rename column label to target.

      3. Rename column comment_text to source.

      4. Reset indexes.

  2. seara/ru_go_emotions

    1. Lang: RU

    2. Rows: 5430

    3. Preprocess:

      1. Select simplified subset.

      2. Drop columns id and text.

      3. Convert column labels to tuple.

      4. Remove from labels values 0, 4, 5, 6, 7, 8, 10, 12, 15, 18, 21, 22, 23.

      5. Rename column labels to target.

      6. Rename column ru_text to source.

      7. Group emotions and change numbers to words:

        1. Labels 1, 13, 17, 20 change to label 1.

        2. Labels 9, 16, 24, 25 change to label 2.

        3. Labels 14, 19 change to label 3.

        4. Labels 2, 3 change to label 4.

        5. Labels 27 change to label 7.

        6. Labels 26 change to label 6.

        7. Other labels to label 8.

      8. Delete duplicates in target.

      9. Clean column source.

      10. Reset indexes.

  3. papluca/language-identification

    1. Lang: EN

    2. Rows: 10000

    3. Preprocess:

      1. Rename column labels to target.

      2. Rename column text to source.

      3. Map language abbreviation to label classes.

      4. Reset indexes.

  4. ag_news

    1. Lang: EN

    2. Rows: 7600

    3. Preprocess:

      1. Rename column label to target.

      2. Rename column text to source.

      3. Reset indexes.

  5. imdb

    1. Lang: EN

    2. Rows: 25000

    3. Preprocess:

      1. Select test split.

      2. Rename column labels to target.

      3. Rename column text to source.

      4. Reset indexes.

  6. dair-ai/emotion

    1. Lang: EN

    2. Rows: 2000

    3. Preprocess:

      1. Select split subset.

      2. Select validation split.

      3. Rename column label to target.

      4. Rename column text to source.

      5. Reset indexes.

  7. blinoff/kinopoisk

    1. Lang: RU

    2. Rows: 36591

    3. Preprocess:

      1. Select validation split.

      2. Leave only content and grade3 columns.

      3. Rename column grade3 to target.

      4. Rename column content to source.

      5. Delete empty rows in dataset.

      6. Map target with class labels.

      7. Reset indexes.

  8. blinoff/healthcare_facilities_reviews

    1. Lang: RU

    2. Rows: 70597

    3. Preprocess:

      1. Select validation split.

      2. Leave only content and sentiment columns.

      3. Rename column sentiment to target.

      4. Rename column content to source.

      5. Map target with class labels.

Note

In combination with a multiclass model blanchefort/rubert-base-cased-sentiment-rusentiment it is necessary to bring the neutral class to the negative class at the prediction stage.

  1. tatiana-merz/cyrillic_turkic_langs

    1. Lang: RU

    2. Rows: 9000

    3. Preprocess:

      1. Select validation split.

      2. Rename column label to target.

      3. Rename column text to source.

      4. Map target with class labels.

  2. s-nlp/ru_paradetox_toxicity

  1. Lang: RU

  2. Rows: 6350

  3. Preprocess:

    1. Rename column toxic to target.

    2. Rename column neutral to source.

    3. Delete duplicates in dataset.

    4. Map target with class labels.

    5. Reset indexes.

  1. d0rj/rudetoxifier_data

  1. Lang: RU

  2. Rows: 163187

  3. Preprocess:

    1. Select train split.

    2. Rename column toxic to target.

    3. Rename column text to source.

  1. s-nlp/ru_non_detoxified

  1. Lang: RU

  2. Rows: 20900

  3. Preprocess:

    1. Rename column reasons to target.

    2. Rename column toxic_comment to source.

    3. Rename {"toxic_content":true} label to 1 and {"not_toxic":true} label to 0.

    4. Remove irrelevant rows in dataset.

    5. Delete duplicates in dataset.

    6. Reset indexes.

  1. Arsive/toxicity_classification_jigsaw

  1. Lang: EN

  2. Rows: 6490

  3. Preprocess:

    1. Select validation split.

    2. Drop column id, severe_toxic, obscene, threat, insult, identity_hate.

    3. Rename column toxic to target.

    4. Rename column comment_text to source.

    5. Reset indexes.

  1. s-nlp/en_paradetox_toxicity

  1. Lang: EN

  2. Rows: 26507

  3. Preprocess:

    1. Select train split.

    2. Rename column toxic to target.

    3. Rename column comment to source.

    4. Reset indexes.

Supervised Fine-Tuning (SFT) Parameters

Note

Set the parameter target_modules=["query", "key", "value", "dense"] for the XSY/albert-base-v2-imdb-calssification model as SFT parameter.

Metrics

  • F1-score