From f70fceb8c7ff242b4f86e8e004c70e7a618ff780 Mon Sep 17 00:00:00 2001 From: sekyonda <127536312+sekyondaMeta@users.noreply.github.com> Date: Wed, 19 Jul 2023 14:48:39 -0400 Subject: [PATCH] Moved inference.md to docs Moved the inference.md to docs to have the docs in one folder. Created a Readme for the inference folder Updated relevant docs to reflect new inference.md link. --- README.md | 4 ++-- docs/FAQ.md | 8 ++++++++ {inference => docs}/inference.md | 8 ++++---- inference/README.md | 10 ++++++++++ 4 files changed, 24 insertions(+), 6 deletions(-) rename {inference => docs}/inference.md (87%) create mode 100644 inference/README.md diff --git a/README.md b/README.md index 90800bcd..e24486a6 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,7 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu - [Single GPU](#single-gpu) - [Multi GPU One Node](#multiple-gpus-one-node) - [Multi GPU Multi Node](#multi-gpu-multi-node) -3. [Inference](./inference/inference.md) +3. [Inference](./docs/inference.md) 4. [Model Conversion](#model-conversion-to-hugging-face) 5. [Repository Organization](#repository-organization) 6. [License and Acceptable Use Policy](#license) @@ -32,7 +32,7 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu * [Multi-GPU Fine-tuning](./docs/mutli_gpu.md) * [LLM Fine-tuning](./docs/LLM_finetuning.md) * [Adding custom datasets](./docs/Dataset.md) -* [Inference](./inference/inference.md) +* [Inference](./docs/inference.md) * [FAQs](./docs/FAQ.md) ## Requirements diff --git a/docs/FAQ.md b/docs/FAQ.md index 0b1c280f..cd16ef8b 100644 --- a/docs/FAQ.md +++ b/docs/FAQ.md @@ -17,3 +17,11 @@ Here we discuss frequently asked questions that may occur and we found useful al 4. Can I add custom datasets? Yes, you can find more information on how to do that [here](Dataset.md). + +5. What are the hardware SKU requirements for deploying these models? + + Hardware requirements vary based on latency, throughput and cost constraints. For good latency, the models were split across multiple GPUs with tensor parallelism in a machine with NVIDIA A100s or H100s. But TPUs, other types of GPUs, or even commodity hardware can also be used to deploy these models (e.g. https://github.com/ggerganov/llama.cpp). + +6. What are the hardware SKU requirements for fine-tuning Llama pre-trained models? + + Fine-tuning requirements vary based on amount of data, time to complete fine-tuning and cost constraints. To fine-tune these models we have generally used multiple NVIDIA A100 machines with data parallelism across nodes and a mix of data and tensor parallelism intra node. But using a single machine, or other GPU types are definitely possible (e.g. alpaca models are trained on a single RTX4090: https://github.com/tloen/alpaca-lora). diff --git a/inference/inference.md b/docs/inference.md similarity index 87% rename from inference/inference.md rename to docs/inference.md index d30ca86b..144431bb 100644 --- a/inference/inference.md +++ b/docs/inference.md @@ -1,6 +1,6 @@ # Inference -For inference we have provided an [inference script](inference.py). Depending on the type of finetuning performed during training the [inference script](inference.py) takes different arguments. +For inference we have provided an [inference script](../inference/inference.py). Depending on the type of finetuning performed during training the [inference script](../inference/inference.py) takes different arguments. To finetune all model parameters the output dir of the training has to be given as --model_name argument. In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument. Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter. @@ -41,12 +41,12 @@ Alternate inference options include: [**vLLM**](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html): To use vLLM you will need to install it using the instructions [here](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#installation). -Once installed, you can use the vLLM_ineference.py script provided [here](vLLM_inference.py). +Once installed, you can use the vLLM_ineference.py script provided [here](../inference/vLLM_inference.py). Below is an example of how to run the vLLM_inference.py script found within the inference folder. ``` bash -python vLLM_inference.py --model_name <PATH/TO/LLAMA/7B> +python vLLM_inference.py --model_name <PATH/TO/MODEL/7B> ``` -[**TGI**](https://github.com/huggingface/text-generation-inference): Text Generation Inference (TGI) is another inference option available to you. For more information on how to set up and use TGI see [here](hf-text-generation-inference/README.md). +[**TGI**](https://github.com/huggingface/text-generation-inference): Text Generation Inference (TGI) is another inference option available to you. For more information on how to set up and use TGI see [here](../inference/hf-text-generation-inference/README.md). diff --git a/inference/README.md b/inference/README.md new file mode 100644 index 00000000..3773b883 --- /dev/null +++ b/inference/README.md @@ -0,0 +1,10 @@ +# Inference + +For inference we have provided an [inference script](inference.py). Depending on the type of finetuning performed during training the [inference script](inference.py) takes different arguments. +To finetune all model parameters the output dir of the training has to be given as --model_name argument. +In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument. +Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter. + +For other inference options, you can use the [vLLM_inference.py](vLLM_inference.py) script for vLLM or review the [hf-text-generation-inference](hf-text-generation-inference/README.md) folder for TGI. + +For more information including inference safety checks, examples and other inference options available to you, see the inference documentation [here](../docs/inference.md). -- GitLab