diff --git a/UPDATES.md b/UPDATES.md index 9124ec65bf213b2d244d3aa600b6baa27db48b29..fcd4555928e52c3c8d8ee2b47e330ca39dcd2eec 100644 --- a/UPDATES.md +++ b/UPDATES.md @@ -1,19 +1,19 @@ ## System Prompt Update ### Observed Issue -We received feedback from the community on our prompt template and we are providing an update to reduce the false refusal rates seen. False refusals occur when the model incorrectly refuses to answer a question that it should, for example due to overly broad instructions to be cautious in how it provides responses. +We received feedback from the community on our prompt template and we are providing an update to reduce the false refusal rates seen. False refusals occur when the model incorrectly refuses to answer a question that it should, for example due to overly broad instructions to be cautious in how it provides responses. ### Updated approach -Based on evaluation and analysis, we recommend the removal of the system prompt as the default setting. Pull request [#626](https://github.com/facebookresearch/llama/pull/626) removes the system prompt as the default option, but still provides an example to help enable experimentation for those using it. +Based on evaluation and analysis, we recommend the removal of the system prompt as the default setting. Pull request [#626](https://github.com/facebookresearch/llama/pull/626) removes the system prompt as the default option, but still provides an example to help enable experimentation for those using it. ## Token Sanitization Update ### Observed Issue -The PyTorch scripts currently provided for tokenization and model inference allow for direct prompt injection via string concatenation. Prompt injections allow for the addition of special system and instruction prompt strings from user-provided prompts. +The PyTorch scripts currently provided for tokenization and model inference allow for direct prompt injection via string concatenation. Prompt injections allow for the addition of special system and instruction prompt strings from user-provided prompts. -As noted in the documentation, these strings are required to use the fine-tuned chat models. However, prompt injections have also been used for manipulating or abusing models by bypassing their safeguards, allowing for the creation of content or behaviors otherwise outside the bounds of acceptable use. +As noted in the documentation, these strings are required to use the fine-tuned chat models. However, prompt injections have also been used for manipulating or abusing models by bypassing their safeguards, allowing for the creation of content or behaviors otherwise outside the bounds of acceptable use. ### Updated approach -We recommend sanitizing [these strings](https://github.com/meta-llama/llama?tab=readme-ov-file#fine-tuned-chat-models) from any user provided prompts. Sanitization of user prompts mitigates malicious or accidental abuse of these strings. The provided scripts have been updated to do this. +We recommend sanitizing [these strings](https://github.com/meta-llama/llama?tab=readme-ov-file#fine-tuned-chat-models) from any user provided prompts. Sanitization of user prompts mitigates malicious or accidental abuse of these strings. The provided scripts have been updated to do this. -Note: even with this update safety classifiers should still be applied to catch unsafe behaviors or content produced by the model. An [example](./recipes/inference/local_inference/inference.py) of how to deploy such a classifier can be found in the llama-recipes repository. +Note: even with this update safety classifiers should still be applied to catch unsafe behaviors or content produced by the model. An [example](./recipes/quickstart/inference/local_inference/inference.py) of how to deploy such a classifier can be found in the llama-recipes repository. diff --git a/docs/FAQ.md b/docs/FAQ.md index 4229bedf8e511712698252515e619744e1d88c5d..6dc3fd91b40acf42015b7c78ef7300470ea8a039 100644 --- a/docs/FAQ.md +++ b/docs/FAQ.md @@ -16,7 +16,7 @@ Here we discuss frequently asked questions that may occur and we found useful al 4. Can I add custom datasets? - Yes, you can find more information on how to do that [here](../recipes/finetuning/datasets/README.md). + Yes, you can find more information on how to do that [here](../recipes/quickstart/finetuning/datasets/README.md). 5. What are the hardware SKU requirements for deploying these models? diff --git a/docs/multi_gpu.md b/docs/multi_gpu.md index 56913103309fb5ea5c3dfd92c2f0255e020b84cc..63ee9b97b22d64b5bd2daa5fc9440011a2929c42 100644 --- a/docs/multi_gpu.md +++ b/docs/multi_gpu.md @@ -9,7 +9,7 @@ To run fine-tuning on multi-GPUs, we will make use of two packages: Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 3 8B model on multiple GPUs in one node or multi-node. ## Requirements -To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details). +To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/quickstart/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details). **Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.** diff --git a/recipes/inference/model_servers/README.md b/recipes/3p_integrations/README.md similarity index 100% rename from recipes/inference/model_servers/README.md rename to recipes/3p_integrations/README.md diff --git a/recipes/inference/model_servers/hf_text_generation_inference/README.md b/recipes/3p_integrations/hf_text_generation_inference/README.md similarity index 99% rename from recipes/inference/model_servers/hf_text_generation_inference/README.md rename to recipes/3p_integrations/hf_text_generation_inference/README.md index 7db1e00e5c444f48c858b2d036eff2ea6113dd46..0f794214cde030904636152f1b1a7fc13f56b103 100644 --- a/recipes/inference/model_servers/hf_text_generation_inference/README.md +++ b/recipes/3p_integrations/hf_text_generation_inference/README.md @@ -2,7 +2,7 @@ This document shows how to serve a fine tuned Llama mode with HuggingFace's text-generation-inference server. This option is currently only available for models that were trained using the LoRA method or without using the `--use_peft` argument. -## Step 0: Merging the weights (Only required if LoRA method was used) +## Step 0: Merging the weights (Only required if LoRA method was used) In case the model was fine tuned with LoRA method we need to merge the weights of the base model with the adapter weight. For this we can use the script `merge_lora_weights.py` which is located in the same folder as this README file. @@ -40,9 +40,3 @@ curl 127.0.0.1:8080/generate_stream \ ``` Further information can be found in the documentation of the [hf text-generation-inference](https://github.com/huggingface/text-generation-inference) solution. - - - - - - diff --git a/recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py b/recipes/3p_integrations/hf_text_generation_inference/merge_lora_weights.py similarity index 100% rename from recipes/inference/model_servers/hf_text_generation_inference/merge_lora_weights.py rename to recipes/3p_integrations/hf_text_generation_inference/merge_lora_weights.py diff --git a/recipes/inference/model_servers/llama-on-prem.md b/recipes/3p_integrations/llama-on-prem.md similarity index 96% rename from recipes/inference/model_servers/llama-on-prem.md rename to recipes/3p_integrations/llama-on-prem.md index 7f8232cfe1542c7d894f5361d6d9b009b9c9e029..d43649a2299cf4fa77636ac418b20dbd4b0e94c2 100644 --- a/recipes/inference/model_servers/llama-on-prem.md +++ b/recipes/3p_integrations/llama-on-prem.md @@ -1,6 +1,6 @@ # Llama 3 On-Prem Inference Using vLLM and TGI -Enterprise customers may prefer to deploy Llama 3 on-prem and run Llama in their own servers. This tutorial shows how to use Llama 3 with [vLLM](https://github.com/vllm-project/vllm) and Hugging Face [TGI](https://github.com/huggingface/text-generation-inference), two leading open-source tools to deploy and serve LLMs, and how to create vLLM and TGI hosted Llama 3 instances with [LangChain](https://www.langchain.com/), an open-source LLM app development framework which we used for our other demo apps: [Getting to Know Llama](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Getting_to_know_Llama.ipynb), Running Llama 3 <!-- markdown-link-check-disable -->[locally](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama3_Anywhere/Running_Llama_on_Mac_Windows_Linux.ipynb) <!-- markdown-link-check-disable --> and [in the cloud](https://github.com/meta-llama/llama-recipes/blob/main/recipes/use_cases/RAG/HelloLlamaCloud.ipynb). See [here](https://medium.com/@rohit.k/tgi-vs-vllm-making-informed-choices-for-llm-deployment-37c56d7ff705) for a detailed comparison of vLLM and TGI. +Enterprise customers may prefer to deploy Llama 3 on-prem and run Llama in their own servers. This tutorial shows how to use Llama 3 with [vLLM](https://github.com/vllm-project/vllm) and Hugging Face [TGI](https://github.com/huggingface/text-generation-inference), two leading open-source tools to deploy and serve LLMs, and how to create vLLM and TGI hosted Llama 3 instances with [LangChain](https://www.langchain.com/), an open-source LLM app development framework which we used for our other demo apps: [Getting to Know Llama](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Getting_to_know_Llama.ipynb), Running Llama 3 <!-- markdown-link-check-disable -->[locally](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama3_Anywhere/Running_Llama_on_Mac_Windows_Linux.ipynb) <!-- markdown-link-check-disable --> and [in the cloud](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/RAG/hello_llama_cloud.ipynb). See [here](https://medium.com/@rohit.k/tgi-vs-vllm-making-informed-choices-for-llm-deployment-37c56d7ff705) for a detailed comparison of vLLM and TGI. For [Ollama](https://ollama.com) based on-prem inference with Llama 3, see the Running Llama 3 locally notebook above. @@ -8,7 +8,7 @@ We'll use the Amazon EC2 instance running Ubuntu with an A10G 24GB GPU as an exa The Colab notebook to connect via LangChain with Llama 3 hosted as the vLLM and TGI API services is [here](https://colab.research.google.com/drive/1rYWLdgTGIU1yCHmRpAOB2D-84fPzmOJg), also shown in the sections below. -This tutorial assumes that you you have been granted access to the Meta Llama 3 on Hugging Face - you can open a Hugging Face Meta model page [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to confirm that you see "Gated model You have been granted access to this model"; if you see "You need to agree to share your contact information to access this model", simply complete and submit the form in the page. +This tutorial assumes that you you have been granted access to the Meta Llama 3 on Hugging Face - you can open a Hugging Face Meta model page [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to confirm that you see "Gated model You have been granted access to this model"; if you see "You need to agree to share your contact information to access this model", simply complete and submit the form in the page. You'll also need your Hugging Face access token which you can get at your Settings page [here](https://huggingface.co/settings/tokens). @@ -108,7 +108,7 @@ On a Google Colab notebook, first install two packages: !pip install langchain openai ``` -Note that you only need to install the `openai` package with an `EMPTY` OpenAI API key to complete the LangChain integration with the OpenAI-compatible vLLM deployment of Llama 3. +Note that you only need to install the `openai` package with an `EMPTY` OpenAI API key to complete the LangChain integration with the OpenAI-compatible vLLM deployment of Llama 3. Then replace the <vllm_server_ip_address> below and run the code: @@ -165,7 +165,7 @@ curl 127.0.0.1:8080/generate_stream -X POST -H 'Content-Type: application/json' "parameters": { "max_new_tokens":200 } - }' + }' ``` and see the answer generated by Llama 3 via TGI like below: @@ -199,4 +199,3 @@ llm("What wrote the book innovators dilemma?") ``` With the Llama 3 instance `llm` created this way, you can integrate seamlessly with LangChain to build powerful on-prem Llama 3 apps. - diff --git a/recipes/inference/model_servers/vllm/inference.py b/recipes/3p_integrations/vllm/inference.py similarity index 100% rename from recipes/inference/model_servers/vllm/inference.py rename to recipes/3p_integrations/vllm/inference.py diff --git a/recipes/README.md b/recipes/README.md index 000e4e39185d5dd6bb5335ccd5c41560a0f43cbd..2bcbbdcb450a38e4d99ee07579ca7c80c9981daf 100644 --- a/recipes/README.md +++ b/recipes/README.md @@ -4,8 +4,8 @@ This folder contains examples organized by topic: |---|---| [quickstart](./quickstart)|The "Hello World" of using Llama 3, start here if you are new to using Llama 3 [multilingual](./multilingual)|Scripts to add a new language to Llama -[finetuning](./finetuning)|Scripts to finetune Llama 3 on single-GPU and multi-GPU setups -[inference](./inference)|Scripts to deploy Llama 3 for inference [locally](./inference/local_inference/), on mobile [Android](./inference/mobile_inference/android_inference/) and using [model servers](./inference/mobile_inference/) +[finetuning](./quickstart/finetuning)|Scripts to finetune Llama 3 on single-GPU and multi-GPU setups +[inference](./quickstart/inference)|Scripts to deploy Llama 3 for inference [locally](./quickstart/inference/local_inference/), on mobile [Android](./quickstart/inference/mobile_inference/android_inference/) and using [model servers](./quickstart/inference/mobile_inference/) [use_cases](./use_cases)|Scripts showing common applications of Llama 3 [responsible_ai](./responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs [llama_api_providers](./llama_api_providers)|Scripts to run inference on Llama via hosted endpoints diff --git a/recipes/use_cases/RAG/HelloLlamaCloud.ipynb b/recipes/quickstart/RAG/hello_llama_cloud.ipynb similarity index 100% rename from recipes/use_cases/RAG/HelloLlamaCloud.ipynb rename to recipes/quickstart/RAG/hello_llama_cloud.ipynb diff --git a/recipes/finetuning/LLM_finetuning_overview.md b/recipes/quickstart/finetuning/LLM_finetuning_overview.md similarity index 100% rename from recipes/finetuning/LLM_finetuning_overview.md rename to recipes/quickstart/finetuning/LLM_finetuning_overview.md diff --git a/recipes/finetuning/README.md b/recipes/quickstart/finetuning/README.md similarity index 100% rename from recipes/finetuning/README.md rename to recipes/quickstart/finetuning/README.md diff --git a/recipes/finetuning/datasets/README.md b/recipes/quickstart/finetuning/datasets/README.md similarity index 94% rename from recipes/finetuning/datasets/README.md rename to recipes/quickstart/finetuning/datasets/README.md index ea2847f73bece6ab48ef048a3ec450ea4fecbff1..ae31bb81db113e01ff8b026bba47cc3c331234e3 100644 --- a/recipes/finetuning/datasets/README.md +++ b/recipes/quickstart/finetuning/datasets/README.md @@ -1,6 +1,6 @@ # Datasets and Evaluation Metrics -The provided fine tuning scripts allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or [`recipes/finetuning/finetuning.py`](../finetuning.py) script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](custom_dataset.py) Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses) +The provided fine tuning scripts allows you to select between three datasets by passing the `dataset` arg to the `llama_recipes.finetuning` module or [`recipes/quickstart/finetuning/finetuning.py`](../finetuning.py) script. The current options are `grammar_dataset`, `alpaca_dataset`and `samsum_dataset`. Additionally, we integrate the OpenAssistant/oasst1 dataset as an [example for a custom dataset](custom_dataset.py) Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses) * [grammar_dataset](https://huggingface.co/datasets/jfleg) contains 150K pairs of english sentences and possible corrections. * [alpaca_dataset](https://github.com/tatsu-lab/stanford_alpaca) provides 52K instruction-response pairs as generated by `text-davinci-003`. diff --git a/recipes/finetuning/datasets/custom_dataset.py b/recipes/quickstart/finetuning/datasets/custom_dataset.py similarity index 100% rename from recipes/finetuning/datasets/custom_dataset.py rename to recipes/quickstart/finetuning/datasets/custom_dataset.py diff --git a/recipes/finetuning/finetuning.py b/recipes/quickstart/finetuning/finetuning.py similarity index 100% rename from recipes/finetuning/finetuning.py rename to recipes/quickstart/finetuning/finetuning.py diff --git a/recipes/finetuning/multi_node.slurm b/recipes/quickstart/finetuning/multi_node.slurm similarity index 100% rename from recipes/finetuning/multi_node.slurm rename to recipes/quickstart/finetuning/multi_node.slurm diff --git a/recipes/finetuning/multigpu_finetuning.md b/recipes/quickstart/finetuning/multigpu_finetuning.md similarity index 100% rename from recipes/finetuning/multigpu_finetuning.md rename to recipes/quickstart/finetuning/multigpu_finetuning.md diff --git a/recipes/finetuning/quickstart_peft_finetuning.ipynb b/recipes/quickstart/finetuning/quickstart_peft_finetuning.ipynb similarity index 100% rename from recipes/finetuning/quickstart_peft_finetuning.ipynb rename to recipes/quickstart/finetuning/quickstart_peft_finetuning.ipynb diff --git a/recipes/finetuning/singlegpu_finetuning.md b/recipes/quickstart/finetuning/singlegpu_finetuning.md similarity index 100% rename from recipes/finetuning/singlegpu_finetuning.md rename to recipes/quickstart/finetuning/singlegpu_finetuning.md diff --git a/recipes/code_llama/README.md b/recipes/quickstart/inference/code_llama/README.md similarity index 100% rename from recipes/code_llama/README.md rename to recipes/quickstart/inference/code_llama/README.md diff --git a/recipes/code_llama/code_completion_example.py b/recipes/quickstart/inference/code_llama/code_completion_example.py similarity index 100% rename from recipes/code_llama/code_completion_example.py rename to recipes/quickstart/inference/code_llama/code_completion_example.py diff --git a/recipes/code_llama/code_completion_prompt.txt b/recipes/quickstart/inference/code_llama/code_completion_prompt.txt similarity index 100% rename from recipes/code_llama/code_completion_prompt.txt rename to recipes/quickstart/inference/code_llama/code_completion_prompt.txt diff --git a/recipes/code_llama/code_infilling_example.py b/recipes/quickstart/inference/code_llama/code_infilling_example.py similarity index 100% rename from recipes/code_llama/code_infilling_example.py rename to recipes/quickstart/inference/code_llama/code_infilling_example.py diff --git a/recipes/code_llama/code_infilling_prompt.txt b/recipes/quickstart/inference/code_llama/code_infilling_prompt.txt similarity index 100% rename from recipes/code_llama/code_infilling_prompt.txt rename to recipes/quickstart/inference/code_llama/code_infilling_prompt.txt diff --git a/recipes/code_llama/code_instruct_example.py b/recipes/quickstart/inference/code_llama/code_instruct_example.py similarity index 100% rename from recipes/code_llama/code_instruct_example.py rename to recipes/quickstart/inference/code_llama/code_instruct_example.py diff --git a/recipes/inference/local_inference/README.md b/recipes/quickstart/inference/local_inference/README.md similarity index 94% rename from recipes/inference/local_inference/README.md rename to recipes/quickstart/inference/local_inference/README.md index 11c3dc5e8c1b1d63822e11e2dd6e9a2205fb5c4d..15d97008dd0f2875b9fa2226047d8460dc9d8888 100644 --- a/recipes/inference/local_inference/README.md +++ b/recipes/quickstart/inference/local_inference/README.md @@ -69,7 +69,7 @@ In case you have fine-tuned your model with pure FSDP and saved the checkpoints This is helpful if you have fine-tuned you model using FSDP only as follows: ```bash -torchrun --nnodes 1 --nproc_per_node 8 recipes/finetuning/finetuning.py --enable_fsdp --model_name /path_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 +torchrun --nnodes 1 --nproc_per_node 8 recipes/quickstart/finetuning/finetuning.py --enable_fsdp --model_name /path_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 ``` Then convert your FSDP checkpoint to HuggingFace checkpoints using: ```bash @@ -82,6 +82,6 @@ By default, training parameter are saved in `train_params.yaml` in the path wher Then run inference using: ```bash -python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> +python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file> -``` \ No newline at end of file +``` diff --git a/recipes/inference/local_inference/chat_completion/chat_completion.py b/recipes/quickstart/inference/local_inference/chat_completion/chat_completion.py similarity index 100% rename from recipes/inference/local_inference/chat_completion/chat_completion.py rename to recipes/quickstart/inference/local_inference/chat_completion/chat_completion.py diff --git a/recipes/inference/local_inference/chat_completion/chats.json b/recipes/quickstart/inference/local_inference/chat_completion/chats.json similarity index 100% rename from recipes/inference/local_inference/chat_completion/chats.json rename to recipes/quickstart/inference/local_inference/chat_completion/chats.json diff --git a/recipes/inference/local_inference/inference.py b/recipes/quickstart/inference/local_inference/inference.py similarity index 100% rename from recipes/inference/local_inference/inference.py rename to recipes/quickstart/inference/local_inference/inference.py diff --git a/recipes/inference/local_inference/samsum_prompt.txt b/recipes/quickstart/inference/local_inference/samsum_prompt.txt similarity index 100% rename from recipes/inference/local_inference/samsum_prompt.txt rename to recipes/quickstart/inference/local_inference/samsum_prompt.txt diff --git a/recipes/inference/mobile_inference/android_inference/README.md b/recipes/quickstart/inference/mobile_inference/android_inference/README.md similarity index 100% rename from recipes/inference/mobile_inference/android_inference/README.md rename to recipes/quickstart/inference/mobile_inference/android_inference/README.md diff --git a/recipes/inference/mobile_inference/android_inference/mlc-package-config.json b/recipes/quickstart/inference/mobile_inference/android_inference/mlc-package-config.json similarity index 100% rename from recipes/inference/mobile_inference/android_inference/mlc-package-config.json rename to recipes/quickstart/inference/mobile_inference/android_inference/mlc-package-config.json diff --git a/recipes/inference/mobile_inference/android_inference/requirements.txt b/recipes/quickstart/inference/mobile_inference/android_inference/requirements.txt similarity index 100% rename from recipes/inference/mobile_inference/android_inference/requirements.txt rename to recipes/quickstart/inference/mobile_inference/android_inference/requirements.txt diff --git a/src/tests/datasets/test_custom_dataset.py b/src/tests/datasets/test_custom_dataset.py index c3fdc0f71bd81fd3066ec5c667acee62c490e139..5b8028af0bfb9344e1da13d9363db4857b80ce1e 100644 --- a/src/tests/datasets/test_custom_dataset.py +++ b/src/tests/datasets/test_custom_dataset.py @@ -51,7 +51,7 @@ def test_custom_dataset(step_lr, optimizer, get_model, tokenizer, train, mocker, kwargs = { "dataset": "custom_dataset", "model_name": llama_version, - "custom_dataset.file": "recipes/finetuning/datasets/custom_dataset.py", + "custom_dataset.file": "recipes/quickstart/finetuning/datasets/custom_dataset.py", "custom_dataset.train_split": "validation", "batch_size_training": 2, "val_batch_size": 4, @@ -108,7 +108,7 @@ def test_unknown_dataset_error(step_lr, optimizer, tokenizer, get_model, train, kwargs = { "dataset": "custom_dataset", - "custom_dataset.file": "recipes/finetuning/datasets/custom_dataset.py:get_unknown_dataset", + "custom_dataset.file": "recipes/quickstart/finetuning/datasets/custom_dataset.py:get_unknown_dataset", "batch_size_training": 1, "use_peft": False, } diff --git a/tools/benchmarks/inference/on_prem/README.md b/tools/benchmarks/inference/on_prem/README.md index 0d2053f5ac76a3d1470fee76e8084170bc9d95cb..8f0da859b7fd5c24bea2c468ebbbf6c20041223c 100644 --- a/tools/benchmarks/inference/on_prem/README.md +++ b/tools/benchmarks/inference/on_prem/README.md @@ -7,7 +7,7 @@ We support benchmark on these serving framework: # vLLM - Getting Started -To get started, we first need to deploy containers on-prem as a API host. Follow the guidance [here](../../../inference/model_servers/llama-on-prem.md#setting-up-vllm-with-llama-3) to deploy vLLM on-prem. +To get started, we first need to deploy containers on-prem as a API host. Follow the guidance [here](../../../3p_integration/llama-on-prem.md#setting-up-vllm-with-llama-3) to deploy vLLM on-prem. Note that in common scenario which overall throughput is important, we suggest you prioritize deploying as many model replicas as possible to reach higher overall throughput and request-per-second (RPS), comparing to deploy one model container among multiple GPUs for model parallelism. Additionally, as deploying multiple model replicas, there is a need for a higher level wrapper to handle the load balancing which here has been simulated in the benchmark scripts. For example, we have an instance from Azure that has 8xA100 80G GPUs, and we want to deploy the Meta Llama 3 70B instruct model, which is around 140GB with FP16. So for deployment we can do: