Skip to content
Snippets Groups Projects
Unverified Commit 3823389e authored by araza008's avatar araza008 Committed by GitHub
Browse files

Update TensorRT-LLM version to 0.7.0 to resolve installation failures (#10679)

parent f4bcb31f
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
<a href="https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/llm/nvidia_tensorrt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
%% Cell type:markdown id: tags:
# Nvidia TensorRT-LLM
%% Cell type:markdown id: tags:
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.
[TensorRT-LLM Github](https://github.com/NVIDIA/TensorRT-LLM)
%% Cell type:markdown id: tags:
## TensorRT-LLM Environment Setup
Since TensorRT-LLM is a SDK for interacting with local models in process there are a few environment steps that must be followed to ensure that the TensorRT-LLM setup can be used.
1. Nvidia Cuda 12.2 or higher is currently required to run TensorRT-LLM
2. Install `tensorrt_llm` via pip with `pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com`
3. For this example we will use Llama2. The Llama2 model files need to be created via scripts following the instructions [here](https://github.com/NVIDIA/trt-llm-rag-windows/blob/release/1.0/README.md#building-trt-engine)
* The following files will be created from following the stop above
* `Llama_float16_tp1_rank0.engine`: The main output of the build script, containing the executable graph of operations with the model weights embedded.
* `config.json`: Includes detailed information about the model, like its general structure and precision, as well as information about which plug-ins were incorporated into the engine.
* `model.cache`: Caches some of the timing and optimization information from model compilation, making successive builds quicker.
4. `mkdir model`
5. Move all of the files mentioned above to the model directory.
%% Cell type:code id: tags:
``` python
%pip install llama-index-llms-nvidia-tensorrt
```
%% Cell type:code id: tags:
``` python
!pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com
!pip install tensorrt_llm==0.7.0 --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121
```
%% Cell type:markdown id: tags:
## Basic Usage
%% Cell type:markdown id: tags:
#### Call `complete` with a prompt
%% Cell type:code id: tags:
``` python
from llama_index.llms.nvidia_tensorrt import LocalTensorRTLLM
def completion_to_prompt(completion: str) -> str:
"""
Given a completion, return the prompt using llama2 format.
"""
return f"<s> [INST] {completion} [/INST] "
llm = LocalTensorRTLLM(
model_path="./model",
engine_name="llama_float16_tp1_rank0.engine",
tokenizer_dir="meta-llama/Llama-2-13b-chat",
completion_to_prompt=completion_to_prompt,
)
```
%% Cell type:code id: tags:
``` python
resp = llm.complete("Who is Paul Graham?")
print(str(resp))
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment