Generation
Models
Model |
Lang |
Task |
|---|---|---|
EN |
CLOSED QA |
|
EN |
CLOSED QA |
|
EN |
OPEN QA |
|
EN |
OPEN QA |
|
EN |
OPEN QA |
Datasets CLOSED QA
starmpcc/Asclepius-Synthetic-Clinical-Notes
Lang: EN
Rows: 20038
Preprocess:
Select
trainsplit.Choose task
Question Answering.Choose columns
note,questionandanswer.Rename column
notetocontext.Rename column
answertotarget.Reset indexes.
-
Lang: EN
Rows: 89
Preprocess:
Select
testsplit.Choose columns
instruction,contextandresponse.Rename column
instructiontoquestion.Rename column
responsetotarget.Reset indexes.
-
Lang: EN
Rows: 245
Preprocess:
Select
trainsplit.Choose category
Closed QA.Choose columns
prompt,messages.Convert column
messagesto string, using f-string.Rename column
prompttoquestion.Reset indexes.
Process column
messageswith raw text into two columnscontextandanswer.
-
Lang: RU
Rows: 5040
Preprocess:
Select
validationsplit.Choose columns
question,context,answers.Rename column
answerstotarget.Process column
targetwith raw text to leave just an answer in this column.
-
Lang: RU
Rows: 173000
Preprocess:
Select
trainsplit and`wikiomnia_ruGPT3_filteredsubset.Drop NaN.
Drop duplicates.
Reset indexes.
Choose columns
question,summary,answer.Rename columns
summarytocontextandanswertotarget.
Note
When obtaining this dataset, pass the following parameters to the call of
load_dataset:
revision="refs/convert/parquet"data_files={"train": "wikiomnia_ruGPT3_filtered/train/*.parquet"}
Inferring batch
Process of implementing method
lab_7_llm.main.LLMPipeline._infer_batch()
for closed question-answering task has its specifics:
You need to transpose the
sample_batchbefore you pass it to the tokenizer, so that it is a sequence of tuples where each tuple has two strings: a question and a context.The prediction of the model will consist of two tensors that contain start and end scores respectively.
Only the ids between start and end location corresponding to the answer have to be decoded and passed on.
To get the ids, iterate through
input_idsfield of the tokenized batch.
Metrics CLOSED QA
squad
Note
To calculate the squad metric, you need to convert the data
into a special structure. This structure you can find in
this repository
in the metrics directory.
Important
You need to use f1 score of two scores available in squad.
Datasets OPEN QA
-
Lang: EN
Rows: 817
Preprocess:
Select
trainsplit.Drop columns
Type,Category,Correct Answers,Incorrect Answers,Source.Rename column
Best Answertotarget.
jtatman/databricks-dolly-8k-qa-open-close
Lang: EN
Rows: 7706
Preprocess:
Select
trainsplit.Filter dataset rows by
category==open_qa.Drop columns
context,category,__index_level_0__.Rename column
instructiontoquestion.Rename column
responsetotarget.
-
Lang: EN
Rows: 52002
Preprocess:
Select
trainsplit.Drop columns
input,text.Rename column
instructiontoquestion.Rename column
outputtotarget.
-
Lang: EN
Rows: 188
Preprocess:
Select
testsplit.Drop columns
context,category,text.Rename column
instructiontoquestion.Rename column
responsetotarget.
Metrics OPEN QA
BLEU
ROUGE