NER
Models
Model |
Lang |
Task |
---|---|---|
EN |
NER |
|
EN |
NER |
Datasets
-
Lang: EN
Rows: 11590
Preprocess:
Select
val_en
split.Rename column
ner_tags
totarget
.Rename column
tokens
tosource
.Reset indexes.
-
Lang: EN
Rows: 3250
Preprocess:
Select
validation
split.Rename column
ner_tags
totarget
.Rename column
tokens
tosource
.Reset indexes.
Inferring batch
Process of implementing method
lab_7_llm.main.LLMPipeline._infer_batch()
for named entity recognition task has its specifics:
You need to set the
is_split_into_words=True
parameter during the tokenization.The prediction of the model will contain a tensor with labels for each token obtained during tokenization of
sample_batch
.The number of labels corresponds to the number of tokens.
To assess the quality of the model, it is necessary that the number of labels coincides with the length of the original sequence.
You need to process model prediction result so that the prediction contains only the labels of the first tokens of each word. Use the
word_ids
method of the tokenizer to determine the word boundaries.
Note
For example, there is a sample ['CRICKET', '-', 'LEICESTERSHIRE', 'TAKE', 'OVER', 'AT', 'TOP', '.']
which is tokenized to ['[CLS]', 'CR', '##IC', '##KE', '##T', '-', 'L', '##EI', '##CE',
'##ST', '##ER', '##S', '##H', '##IR', '##E', 'T', '##A', '##KE', 'O', '##VE', '##R', 'AT', 'TO', '##P',
'[SEP]']
. In this case, each token corresponds to the following predictions
[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
.
Only the labels for the first token of each word need to be included in the final result,
namely [[0, 0, 0, 1, 0, 0, 0, 0, 0]]
. Thus, if the model predicted label 1
for the first token
of the word LEICESTERSHIRE
, then the final result for this word will include 1
.
Metrics
Accuracy