Skip to content
Snippets Groups Projects
Commit 190d543b authored by Matthias Reso's avatar Matthias Reso
Browse files

Add fp8 references

parent c1679454
Branches
Tags
No related merge requests found
...@@ -30,7 +30,8 @@ The script will ask for another prompt ina loop after completing the generation ...@@ -30,7 +30,8 @@ The script will ask for another prompt ina loop after completing the generation
When using multiple gpus the model will automatically be split accross the available GPUs using tensor parallelism. When using multiple gpus the model will automatically be split accross the available GPUs using tensor parallelism.
## Multi-node multi-gpu inference ## Multi-node multi-gpu inference
Models like the unquantized variant of Meta-Llama-3.1-405B are too large to be executed on an single node and therefore need multi-node inference. The FP8 quantized veriants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the script located in this folder.
To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need multi-node inference.
vLLM allows this by leveraging pipeline parallelism accros nodes while still applying tensor parallelism insid each node. vLLM allows this by leveraging pipeline parallelism accros nodes while still applying tensor parallelism insid each node.
To start a multi-node inference we first need to set up a ray serves which well be leveraged by vLLM to execute the model across node boundaries. To start a multi-node inference we first need to set up a ray serves which well be leveraged by vLLM to execute the model across node boundaries.
......
...@@ -86,4 +86,5 @@ python inference.py --model_name <training_config.output_dir> --prompt_file <tes ...@@ -86,4 +86,5 @@ python inference.py --model_name <training_config.output_dir> --prompt_file <tes
``` ```
## Inference on large models like Meta Llama 405B ## Inference on large models like Meta Llama 405B
To run the Meta Llama 405B variant without quantization we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md). The FP8 quantized veriants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the scripts located in this folder.
\ No newline at end of file To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md).
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment