Dataset.md



Datasets and Evaluation Metrics
The provided fine tuning script allows you to select between three datasets by passing the dataset arg to the llama_finetuning.py script. The current options are grammar_dataset, alpaca_datasetand samsum_dataset. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)


grammar_dataset contains 150K pairs of english sentences and possible corrections.

alpaca_dataset provides 52K instruction-response pairs as generated by text-davinci-003.

samsum_dataset contains about 16k messenger-like conversations with summaries.


Adding custom datasets
The list of available datasets can easily be extended with custom datasets by following these instructions.
Each dataset has a corresponding configuration (dataclass) in configs/datasets.py which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
Additionally, there is a preprocessing function for each dataset in the ft_datasets folder.
The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling model(**data).
For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.
To add a custom dataset the following steps need to be performed.

Create a dataset configuration after the schema described above. Examples can be found in configs/datasets.py.
Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in utils/dataset_utils.py

Set dataset field in training config to dataset name or use --dataset option of the llama_finetuning.py training script.


Application
Below we list other datasets and their main use cases that can be used for fine tuning.

Q&A these can be used for evaluation as well

MMLU
BoolQ
NarrativeQA

NaturalQuestions (closed-book)

NaturalQuestions (open-book)
QuAC
HellaSwag
OpenbookQA

TruthfulQA ( can be helpful for fact checking/ misinformation of the model)


instruction finetuning


Alpaca	52k	instruction tuning

Dolly 15k	15k	instruction tuning


simple text generation for quick tests
English quotes	2508	Multi-label text classification, text generation

Reasoning used mostly for evaluation of LLMs

bAbI
Dyck
GSM8K
MATH
APPS
HumanEval
LSAT
Entity matching


Toxicity evaluation

Real_toxic_prompts


Bias evaluation


Crows_pair gender bias
WinoGender gender bias


Useful Links
More information on evaluation dataset can be found in HELM