Skip to content
Snippets Groups Projects
Commit cfacee43 authored by Hamid Shojanazeri's avatar Hamid Shojanazeri Committed by Matthias Reso
Browse files

Update LLM_finetuning.md

parent 62dd2b3f
No related branches found
No related tags found
No related merge requests found
......@@ -46,7 +46,7 @@ In this scenario depending on the model size, you might need to go beyond one GP
The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes).
For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.
**FSDP (FUlly Sharded Data Parallel)**
**FSDP (Fully Sharded Data Parallel)**
Pytorch has the FSDP package for training models that do not fit into one GPU. FSDP lets you train a much larger model with the same amount of resources. Prior to FSDP was DDP (Distributed Data Parallel) where each GPU was holding a full replica of the model and would only shard the data. At the end of backward pass it would sync up the gradients.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment