@@ -7,6 +7,18 @@ The provided fine tuning script allows you to select between three datasets by p
...
@@ -7,6 +7,18 @@ The provided fine tuning script allows you to select between three datasets by p
*[samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries.
*[samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries.
*[OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1/) contains about 88k messages from assistant-style conversations.
*[OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1/) contains about 88k messages from assistant-style conversations.
## Batching Strategies
Llama-recipes support two strategies to batch requests together.
The default setting is `packing` which concatenates the tokenized samples into long sequences filling up the context length of the model.
This is the most compute efficient variant as it avoids any padding and all sequences have the same langth.
Samples at the boundary of the context length are truncated and the remainder of the cut sequence it used as the start of the next long sequence.
If the amount of training data is small this proceedure might introduce a lot of noise into the training data which can hurt the prediction performance of the fine-tune model.
Therefore, we also support a `padding` strategy which does not introduce the addition noise due to truncated sequences.
The strategy tries to minimize the efficiency loss by batching samples of similar length together so only minimal padding is necessary.
The batching strategy can be selected though the command line parameter `--batching_strategy [packing]/[padding]`.
## Using custom datasets
## Using custom datasets
The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model.
The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model.
...
@@ -25,7 +37,7 @@ The `dataset_config` in the above signature will be an instance of llama_recipes
...
@@ -25,7 +37,7 @@ The `dataset_config` in the above signature will be an instance of llama_recipes
The split signals wether to return the training or validation dataset.
The split signals wether to return the training or validation dataset.
The default function name is `get_custom_dataset` but this can be changes as described below.
The default function name is `get_custom_dataset` but this can be changes as described below.
In order to start a training with the custom dataset we need to set the `--dataset` as well as the `--custom_dataset.file` parameter.
In order to start a training with the custom dataset we need to set the `--dataset` as well as the `--custom_dataset.file` parameter.
This will call the function `get_foo` instead of `get_custom_dataset` when retrieving the dataset.
This will call the function `get_foo` instead of `get_custom_dataset` when retrieving the dataset.
### Adding new dataset
### Adding new dataset
Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
Additionally, there is a preprocessing function for each dataset in the [datasets](../src/llama_recipes/datasets) folder.
Additionally, there is a preprocessing function for each dataset in the [datasets](../src/llama_recipes/datasets) folder.