From 71e779694f4abf3f4fc1b596cf4c110e9bfaf627 Mon Sep 17 00:00:00 2001
From: Matthias Reso <13337103+mreso@users.noreply.github.com>
Date: Fri, 8 Sep 2023 18:26:16 +0000
Subject: [PATCH] Add documentation for oaast1 dataset

---
 README.md       | 2 +-
 docs/Dataset.md | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index f1a1dac2..cc360820 100644
--- a/README.md
+++ b/README.md
@@ -101,7 +101,7 @@ If you want to dive right into single or multi GPU fine-tuning, run the examples
 All the parameters in the examples and recipes below need to be further tuned to have desired results based on the model, method, data and task at hand.
 
 **Note:**
-* To change the dataset in the commands below pass the `dataset` arg. Current options for integrated dataset are `grammar_dataset`, `alpaca_dataset`and  `samsum_dataset`. A description of how to use your own dataset and how to add custom datasets can be found in [Dataset.md](./docs/Dataset.md#using-custom-datasets). For  `grammar_dataset`, `alpaca_dataset` please make sure you use the suggested instructions from [here](./docs/single_gpu.md#how-to-run-with-different-datasets) to set them up.
+* To change the dataset in the commands below pass the `dataset` arg. Current options for integrated dataset are `grammar_dataset`, `alpaca_dataset`and  `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](./examples/custom_dataset.py).  A description of how to use your own dataset and how to add custom datasets can be found in [Dataset.md](./docs/Dataset.md#using-custom-datasets). For  `grammar_dataset`, `alpaca_dataset` please make sure you use the suggested instructions from [here](./docs/single_gpu.md#how-to-run-with-different-datasets) to set them up.
 
 * Default dataset and other LORA config has been set to `samsum_dataset`.
 
diff --git a/docs/Dataset.md b/docs/Dataset.md
index cc17be7e..e48fd7c1 100644
--- a/docs/Dataset.md
+++ b/docs/Dataset.md
@@ -1,10 +1,11 @@
 # Datasets and Evaluation Metrics
 
-The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or `examples/finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
+The provided fine tuning script allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or `examples/finetuning.py` script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](./examples/custom_dataset.py) Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
 
 * [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections.
 * [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`.
 * [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries.
+* [OpenAssistent/oaast1](https://huggingface.co/datasets/OpenAssistant/oasst1/) contains about 88k messages from assistant-style conversations.
 
 ## Using custom datasets
 
-- 
GitLab