Skip to content
Snippets Groups Projects
Commit afb3b758 authored by Matthias Reso's avatar Matthias Reso
Browse files

Add 405B + QLoRA + FSDP to multi_gpu.md doc

parent 939c88fb
No related branches found
No related tags found
No related merge requests found
...@@ -6,13 +6,12 @@ To run fine-tuning on multi-GPUs, we will make use of two packages: ...@@ -6,13 +6,12 @@ To run fine-tuning on multi-GPUs, we will make use of two packages:
2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning). 2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning).
Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 3 8B model on multiple GPUs in one node or multi-node. Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 8B model on multiple GPUs in one node.
For big models like 405B we will need to fine-tune in a multi-node setup even if 4bit quantization is enabled.
## Requirements ## Requirements
To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/quickstart/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details). To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/quickstart/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
**Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
## How to run it ## How to run it
Get access to a machine with multiple GPUs ( in this case we tested with 4 A100 and A10s). Get access to a machine with multiple GPUs ( in this case we tested with 4 A100 and A10s).
...@@ -61,7 +60,7 @@ torchrun --nnodes 1 --nproc_per_node 8 recipes/quickstart/finetuning/finetuning ...@@ -61,7 +60,7 @@ torchrun --nnodes 1 --nproc_per_node 8 recipes/quickstart/finetuning/finetuning
This has been tested on 4 H100s GPUs. This has been tested on 4 H100s GPUs.
```bash ```bash
FSDP_CPU_RAM_EFFICIENT_LOADING=1 ACCELERATE_USE_FSDP=1 torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --quantization int4 --model_name /path_of_model_folder/70B --mixed_precision False --low_cpu_fsdp --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model FSDP_CPU_RAM_EFFICIENT_LOADING=1 ACCELERATE_USE_FSDP=1 torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --quantization 4bit --model_name /path_of_model_folder/70B --mixed_precision False --low_cpu_fsdp --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
``` ```
### Fine-tuning using FSDP on 70B Model ### Fine-tuning using FSDP on 70B Model
...@@ -85,6 +84,22 @@ sbatch recipes/quickstart/finetuning/multi_node.slurm ...@@ -85,6 +84,22 @@ sbatch recipes/quickstart/finetuning/multi_node.slurm
``` ```
To fine-tune the Meta Llama 405B model with LoRA on 32xH100, 80 GB GPUs we need to combine 4bit quantization (QLoRA) and FSDP.
We can achieve this by adding the following environment variables to the slurm script (before the srun command in the bottom).
```bash
export FSDP_CPU_RAM_EFFICIENT_LOADING=1
export ACCELERATE_USE_FSDP=1
```
Then we need to replace the bottom srun command with the following:
```bash
srun torchrun --nproc_per_node 8 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 ./finetuning.py --enable_fsdp --use_peft --peft_method lora --quantization 4bit --quantization_config.quant_type nf4 --mixed_precision False --low_cpu_fsdp
```
Do not forget to adujust the number of nodes, ntasks and gpus-per-task in the top.
## How to run with different datasets? ## How to run with different datasets?
Currently 4 datasets are supported that can be found in [Datasets config file](../src/llama_recipes/configs/datasets.py). Currently 4 datasets are supported that can be found in [Datasets config file](../src/llama_recipes/configs/datasets.py).
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment