Skip to content
Snippets Groups Projects
Commit 71e77969 authored by Matthias Reso's avatar Matthias Reso
Browse files

Add documentation for oaast1 dataset

parent dc507b4e
Branches
Tags
No related merge requests found
...@@ -101,7 +101,7 @@ If you want to dive right into single or multi GPU fine-tuning, run the examples ...@@ -101,7 +101,7 @@ If you want to dive right into single or multi GPU fine-tuning, run the examples
All the parameters in the examples and recipes below need to be further tuned to have desired results based on the model, method, data and task at hand. All the parameters in the examples and recipes below need to be further tuned to have desired results based on the model, method, data and task at hand.
**Note:** **Note:**
* To change the dataset in the commands below pass the `dataset` arg. Current options for integrated dataset are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. A description of how to use your own dataset and how to add custom datasets can be found in [Dataset.md](./docs/Dataset.md#using-custom-datasets). For `grammar_dataset`, `alpaca_dataset` please make sure you use the suggested instructions from [here](./docs/single_gpu.md#how-to-run-with-different-datasets) to set them up. * To change the dataset in the commands below pass the `dataset` arg. Current options for integrated dataset are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](./examples/custom_dataset.py). A description of how to use your own dataset and how to add custom datasets can be found in [Dataset.md](./docs/Dataset.md#using-custom-datasets). For `grammar_dataset`, `alpaca_dataset` please make sure you use the suggested instructions from [here](./docs/single_gpu.md#how-to-run-with-different-datasets) to set them up.
* Default dataset and other LORA config has been set to `samsum_dataset`. * Default dataset and other LORA config has been set to `samsum_dataset`.
......
# Datasets and Evaluation Metrics # Datasets and Evaluation Metrics
The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or `examples/finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses) The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or `examples/finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](./examples/custom_dataset.py) Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
* [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections. * [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections.
* [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`. * [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
* [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries. * [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries.
* [OpenAssistent/oaast1](https://huggingface.co/datasets/OpenAssistant/oasst1/) contains about 88k messages from assistant-style conversations.
## Using custom datasets ## Using custom datasets
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment