NER

Models

Model	Lang	Task
dslim/distilbert-NER	EN	NER
Babelscape/wikineural-multilingual-ner	EN	NER

Datasets

Babelscape/wikineural
1. Lang: EN
2. Rows: 11590
3. Preprocess:
  1. Select val_en split.
  2. Rename column ner_tags to target.
  3. Rename column tokens to source.
  4. Reset indexes.
eriktks/conll2003
1. Lang: EN
2. Rows: 3250
3. Preprocess:
  1. Select validation split.
  2. Rename column ner_tags to target.
  3. Rename column tokens to source.
  4. Reset indexes.

Inferring batch

Process of implementing method lab_7_llm.main.LLMPipeline._infer_batch() for named entity recognition task has its specifics:

You need to set the is_split_into_words=True parameter during the tokenization.

The prediction of the model will contain a tensor with labels for each token obtained during tokenization of sample_batch.

The number of labels corresponds to the number of tokens.

To assess the quality of the model, it is necessary that the number of labels coincides with the length of the original sequence.

You need to process model prediction result so that the prediction contains only the labels of the first tokens of each word. Use the word_ids method of the tokenizer to determine the word boundaries.

Note

For example, there is a sample ['CRICKET', '-', 'LEICESTERSHIRE', 'TAKE', 'OVER', 'AT', 'TOP', '.'] which is tokenized to ['[CLS]', 'CR', '##IC', '##KE', '##T', '-', 'L', '##EI', '##CE', '##ST', '##ER', '##S', '##H', '##IR', '##E', 'T', '##A', '##KE', 'O', '##VE', '##R', 'AT', 'TO', '##P', '[SEP]']. In this case, each token corresponds to the following predictions [[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]. Only the labels for the first token of each word need to be included in the final result, namely [[0, 0, 0, 1, 0, 0, 0, 0, 0]]. Thus, if the model predicted label 1 for the first token of the word LEICESTERSHIRE, then the final result for this word will include 1.

Metrics

Accuracy