Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

Datasets and Evaluation Metrics

The provided fine tuning script allows you to select between three datasets by passing the dataset arg to the llama_finetuning.py script. The current options are grammar_dataset, alpaca_datasetand samsum_dataset. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)

  • grammar_dataset contains 150K pairs of english sentences and possible corrections.
  • alpaca_dataset provides 52K instruction-response pairs as generated by text-davinci-003.
  • samsum_dataset contains about 16k messenger-like conversations with summaries.

Adding custom datasets

The list of available datasets can easily be extended with custom datasets by following these instructions.

Each dataset has a corresponding configuration (dataclass) in configs/datasets.py which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.

Additionally, there is a preprocessing function for each dataset in the ft_datasets folder. The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling model(**data). For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.

To add a custom dataset the following steps need to be performed.

  1. Create a dataset configuration after the schema described above. Examples can be found in configs/datasets.py.
  2. Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
  3. Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in utils/dataset_utils.py
  4. Set dataset field in training config to dataset name or use --dataset option of the llama_finetuning.py training script.

Application

Below we list other datasets and their main use cases that can be used for fine tuning.

Q&A these can be used for evaluation as well

instruction finetuning

  • Alpaca 52k instruction tuning
  • Dolly 15k 15k instruction tuning

simple text generation for quick tests

English quotes 2508 Multi-label text classification, text generation

Reasoning used mostly for evaluation of LLMs

Toxicity evaluation

Bias evaluation

Useful Links

More information on evaluation dataset can be found in HELM