Generation
Models
Model |
Lang |
Task |
---|---|---|
EN/RU |
CLOSED QA |
|
EN |
CLOSED QA |
|
EN |
OPEN QA |
|
EN |
OPEN QA |
|
EN |
OPEN QA |
Datasets CLOSED QA
starmpcc/Asclepius-Synthetic-Clinical-Notes
Lang: EN
Rows: 20038
Preprocess:
Choose task
Question Answering
.Choose columns
note
,question
andanswer
.Rename column
note
tocontext
.Rename column
answer
totarget
.Reset indexes.
-
Lang: EN
Rows: 1773
Preprocess:
Choose columns
instruction
,context
andresponse
.Rename column
instruction
toquestion
.Rename column
response
totarget
.Reset indexes.
-
Lang: EN
Rows: 260
Preprocess:
Select
train_sft
split.Choose category
Closed QA
.Choose columns
prompt
,messages
.Rename column
prompt
toquestion
.Reset indexes.
Process column
messages
with raw text into two columnscontext
andanswer
.
-
Lang: RU
Rows: 5040
Preprocess:
Select
validation
split.Choose columns
question
,context
,answers
.Rename column
answers
totarget
.Process column
target
with raw text to leave just an answer in this column.
-
Lang: RU
Rows: 173000
Preprocess:
Select
train
split and`wikiomnia_ruGPT3_filtered
subset.Drop NaN.
Drop duplicates
Reset indexes.
Choose columns
question
,summary
,answer
.Rename columns
summary
tocontext
andanswer
totarget
.
Datasets OPEN QA
-
Lang: EN
Rows: 817
Preprocess:
Drop columns
type
,category
,correct_answers
,incorrect_answers
,source
.Rename column
best_answer
totarget
.
jtatman/databricks-dolly-8k-qa-open-close
Lang: EN
Rows: 7706
Preprocess:
Filter dataset rows by
category
==open_qa
.Drop columns
context
,category
,__index_level_0__
.Rename column
instruction
toquestion
.Rename column
response
totarget
.
-
Lang: EN
Rows: 52002
Preprocess:
Drop columns
input
,text
.Rename column
instruction
toquestion
.Rename column
output
totarget
.
-
Lang: EN
Rows: 188
Preprocess:
Drop columns
context
,category
,text
.Rename column
instruction
toquestion
.Rename column
response
totarget
.
Inferring batch
Process of implementing method
stubs.labs.lab_7_llm.main.LLMPipeline._infer_batch()
for closed question-answering task has its specifics:
You need to transpose the
sample_batch
before you pass it to the tokenizer, so that it is a sequence of tuples where each tuple has two strings: a question and a context.The prediction of the model will consist of two tensors that contain start and end scores respectively.
Only the ids between start and end location corresponding to the answer have to be decoded and passed on.
To get the ids, iterate through
input_ids
field of the tokenized batch.
Metrics
Open QA
BLEU
ROUGE
Closed QA
squad
Note
To calculate the squad metric, you need to convert the data
into a special structure. This structure you can find in
this repository
in the metrics
directory.