From 190d543b53f2c2bbfa54aad4ce26c4e9a2032f1a Mon Sep 17 00:00:00 2001 From: Matthias Reso <13337103+mreso@users.noreply.github.com> Date: Mon, 22 Jul 2024 12:26:47 -0700 Subject: [PATCH] Add fp8 references --- recipes/3p_integrations/vllm/README.md | 3 ++- recipes/quickstart/inference/local_inference/README.md | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/recipes/3p_integrations/vllm/README.md b/recipes/3p_integrations/vllm/README.md index 17d79785..396c7288 100644 --- a/recipes/3p_integrations/vllm/README.md +++ b/recipes/3p_integrations/vllm/README.md @@ -30,7 +30,8 @@ The script will ask for another prompt ina loop after completing the generation When using multiple gpus the model will automatically be split accross the available GPUs using tensor parallelism. ## Multi-node multi-gpu inference -Models like the unquantized variant of Meta-Llama-3.1-405B are too large to be executed on an single node and therefore need multi-node inference. +The FP8 quantized veriants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the script located in this folder. +To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need multi-node inference. vLLM allows this by leveraging pipeline parallelism accros nodes while still applying tensor parallelism insid each node. To start a multi-node inference we first need to set up a ray serves which well be leveraged by vLLM to execute the model across node boundaries. diff --git a/recipes/quickstart/inference/local_inference/README.md b/recipes/quickstart/inference/local_inference/README.md index 2918d85e..a8afa076 100644 --- a/recipes/quickstart/inference/local_inference/README.md +++ b/recipes/quickstart/inference/local_inference/README.md @@ -86,4 +86,5 @@ python inference.py --model_name <training_config.output_dir> --prompt_file <tes ``` ## Inference on large models like Meta Llama 405B -To run the Meta Llama 405B variant without quantization we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md). \ No newline at end of file +The FP8 quantized veriants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the scripts located in this folder. +To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md). \ No newline at end of file -- GitLab