Skip to content
Snippets Groups Projects
Commit f70fceb8 authored by sekyonda's avatar sekyonda
Browse files

Moved inference.md to docs

Moved the inference.md to docs to have the docs in one folder.

Created a Readme for the inference folder

Updated relevant docs to reflect new inference.md link.
parent ed4dcafb
No related branches found
No related tags found
No related merge requests found
...@@ -11,7 +11,7 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu ...@@ -11,7 +11,7 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu
- [Single GPU](#single-gpu) - [Single GPU](#single-gpu)
- [Multi GPU One Node](#multiple-gpus-one-node) - [Multi GPU One Node](#multiple-gpus-one-node)
- [Multi GPU Multi Node](#multi-gpu-multi-node) - [Multi GPU Multi Node](#multi-gpu-multi-node)
3. [Inference](./inference/inference.md) 3. [Inference](./docs/inference.md)
4. [Model Conversion](#model-conversion-to-hugging-face) 4. [Model Conversion](#model-conversion-to-hugging-face)
5. [Repository Organization](#repository-organization) 5. [Repository Organization](#repository-organization)
6. [License and Acceptable Use Policy](#license) 6. [License and Acceptable Use Policy](#license)
...@@ -32,7 +32,7 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu ...@@ -32,7 +32,7 @@ Llama 2 is a new technology that carries potential risks with use. Testing condu
* [Multi-GPU Fine-tuning](./docs/mutli_gpu.md) * [Multi-GPU Fine-tuning](./docs/mutli_gpu.md)
* [LLM Fine-tuning](./docs/LLM_finetuning.md) * [LLM Fine-tuning](./docs/LLM_finetuning.md)
* [Adding custom datasets](./docs/Dataset.md) * [Adding custom datasets](./docs/Dataset.md)
* [Inference](./inference/inference.md) * [Inference](./docs/inference.md)
* [FAQs](./docs/FAQ.md) * [FAQs](./docs/FAQ.md)
## Requirements ## Requirements
......
...@@ -17,3 +17,11 @@ Here we discuss frequently asked questions that may occur and we found useful al ...@@ -17,3 +17,11 @@ Here we discuss frequently asked questions that may occur and we found useful al
4. Can I add custom datasets? 4. Can I add custom datasets?
Yes, you can find more information on how to do that [here](Dataset.md). Yes, you can find more information on how to do that [here](Dataset.md).
5. What are the hardware SKU requirements for deploying these models?
Hardware requirements vary based on latency, throughput and cost constraints. For good latency, the models were split across multiple GPUs with tensor parallelism in a machine with NVIDIA A100s or H100s. But TPUs, other types of GPUs, or even commodity hardware can also be used to deploy these models (e.g. https://github.com/ggerganov/llama.cpp).
6. What are the hardware SKU requirements for fine-tuning Llama pre-trained models?
Fine-tuning requirements vary based on amount of data, time to complete fine-tuning and cost constraints. To fine-tune these models we have generally used multiple NVIDIA A100 machines with data parallelism across nodes and a mix of data and tensor parallelism intra node. But using a single machine, or other GPU types are definitely possible (e.g. alpaca models are trained on a single RTX4090: https://github.com/tloen/alpaca-lora).
# Inference # Inference
For inference we have provided an [inference script](inference.py). Depending on the type of finetuning performed during training the [inference script](inference.py) takes different arguments. For inference we have provided an [inference script](../inference/inference.py). Depending on the type of finetuning performed during training the [inference script](../inference/inference.py) takes different arguments.
To finetune all model parameters the output dir of the training has to be given as --model_name argument. To finetune all model parameters the output dir of the training has to be given as --model_name argument.
In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument. In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter. Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter.
...@@ -41,12 +41,12 @@ Alternate inference options include: ...@@ -41,12 +41,12 @@ Alternate inference options include:
[**vLLM**](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html): [**vLLM**](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html):
To use vLLM you will need to install it using the instructions [here](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#installation). To use vLLM you will need to install it using the instructions [here](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#installation).
Once installed, you can use the vLLM_ineference.py script provided [here](vLLM_inference.py). Once installed, you can use the vLLM_ineference.py script provided [here](../inference/vLLM_inference.py).
Below is an example of how to run the vLLM_inference.py script found within the inference folder. Below is an example of how to run the vLLM_inference.py script found within the inference folder.
``` bash ``` bash
python vLLM_inference.py --model_name <PATH/TO/LLAMA/7B> python vLLM_inference.py --model_name <PATH/TO/MODEL/7B>
``` ```
[**TGI**](https://github.com/huggingface/text-generation-inference): Text Generation Inference (TGI) is another inference option available to you. For more information on how to set up and use TGI see [here](hf-text-generation-inference/README.md). [**TGI**](https://github.com/huggingface/text-generation-inference): Text Generation Inference (TGI) is another inference option available to you. For more information on how to set up and use TGI see [here](../inference/hf-text-generation-inference/README.md).
# Inference
For inference we have provided an [inference script](inference.py). Depending on the type of finetuning performed during training the [inference script](inference.py) takes different arguments.
To finetune all model parameters the output dir of the training has to be given as --model_name argument.
In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
Additionally, a prompt for the model in the form of a text file has to be provided. The prompt file can either be piped through standard input or given as --prompt_file parameter.
For other inference options, you can use the [vLLM_inference.py](vLLM_inference.py) script for vLLM or review the [hf-text-generation-inference](hf-text-generation-inference/README.md) folder for TGI.
For more information including inference safety checks, examples and other inference options available to you, see the inference documentation [here](../docs/inference.md).
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment