diff --git a/docs/Dataset.md b/docs/Dataset.md index 034b00a5700394a073e11ac76d8e96d40de1f83f..4f31907cd8762f824c223e4995c5e53af8225d52 100644 --- a/docs/Dataset.md +++ b/docs/Dataset.md @@ -10,10 +10,10 @@ The provided fine tuning script allows you to select between three datasets by p ## Batching Strategies Llama-recipes support two strategies to batch requests together. The default setting is `packing` which concatenates the tokenized samples into long sequences filling up the context length of the model. -This is the most compute efficient variant as it avoids any padding and all sequences have the same langth. +This is the most compute efficient variant as it avoids any padding and all sequences have the same length. Samples at the boundary of the context length are truncated and the remainder of the cut sequence it used as the start of the next long sequence. -If the amount of training data is small this proceedure might introduce a lot of noise into the training data which can hurt the prediction performance of the fine-tune model. +If the amount of training data is small this procedure might introduce a lot of noise into the training data which can hurt the prediction performance of the fine-tune model. Therefore, we also support a `padding` strategy which does not introduce the addition noise due to truncated sequences. The strategy tries to minimize the efficiency loss by batching samples of similar length together so only minimal padding is necessary. diff --git a/scripts/spellcheck_conf/wordlist.txt b/scripts/spellcheck_conf/wordlist.txt index f58d04252cc415578e16d5957c1ed90a79ef42dd..27b3286f72dcd26dfba95a9a76de8382fd77389f 100644 --- a/scripts/spellcheck_conf/wordlist.txt +++ b/scripts/spellcheck_conf/wordlist.txt @@ -1156,4 +1156,5 @@ Autocast FN GBs MLP -learnable \ No newline at end of file +learnable +tokenized