Generation
Models
Model |
Lang |
Task |
|---|---|---|
EN/RU |
CLOSED QA |
|
EN |
CLOSED QA |
|
EN |
OPEN QA |
|
EN |
OPEN QA |
|
EN |
OPEN QA |
Datasets CLOSED QA
starmpcc/Asclepius-Synthetic-Clinical-Notes
Lang: EN
Rows: 20038
Preprocess:
Choose task
Question Answering.Choose columns
note,questionandanswer.Rename column
notetocontext.Rename column
answertotarget.Reset indexes.
-
Lang: EN
Rows: 1773
Preprocess:
Choose columns
instruction,contextandresponse.Rename column
instructiontoquestion.Rename column
responsetotarget.Reset indexes.
-
Lang: EN
Rows: 260
Preprocess:
Select
train_sftsplit.Choose category
Closed QA.Choose columns
prompt,messages.Rename column
prompttoquestion.Reset indexes.
Process column
messageswith raw text into two columnscontextandanswer.
-
Lang: RU
Rows: 5040
Preprocess:
Select
validationsplit.Choose columns
question,context,answers.Rename column
answerstotarget.Process column
targetwith raw text to leave just an answer in this column.
-
Lang: RU
Rows: 173000
Preprocess:
Select
trainsplit and`wikiomnia_ruGPT3_filteredsubset.Drop NaN.
Drop duplicates
Reset indexes.
Choose columns
question,summary,answer.Rename columns
summarytocontextandanswertotarget.
Datasets OPEN QA
-
Lang: EN
Rows: 817
Preprocess:
Drop columns
type,category,correct_answers,incorrect_answers,source.Rename column
best_answertotarget.
jtatman/databricks-dolly-8k-qa-open-close
Lang: EN
Rows: 7706
Preprocess:
Filter dataset rows by
category==open_qa.Drop columns
context,category,__index_level_0__.Rename column
instructiontoquestion.Rename column
responsetotarget.
-
Lang: EN
Rows: 52002
Preprocess:
Drop columns
input,text.Rename column
instructiontoquestion.Rename column
outputtotarget.
-
Lang: EN
Rows: 188
Preprocess:
Drop columns
context,category,text.Rename column
instructiontoquestion.Rename column
responsetotarget.
Inferring batch
Process of implementing method
stubs.labs.lab_7_llm.main.LLMPipeline._infer_batch()
for closed question-answering task has its specifics:
You need to transpose the
sample_batchbefore you pass it to the tokenizer, so that it is a sequence of tuples where each tuple has two strings: a question and a context.The prediction of the model will consist of two tensors that contain start and end scores respectively.
Only the ids between start and end location corresponding to the answer have to be decoded and passed on.
To get the ids, iterate through
input_idsfield of the tokenized batch.
Metrics
Open QA
BLEU
ROUGE
Closed QA
squad
Note
To calculate the squad metric, you need to convert the data
into a special structure. This structure you can find in
this repository
in the metrics directory.