<ahref="https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/llm/nvidia_tensorrt.ipynb"target="_parent"><imgsrc="https://colab.research.google.com/assets/colab-badge.svg"alt="Open In Colab"/></a>
%% Cell type:markdown id: tags:
# Nvidia TensorRT-LLM
%% Cell type:markdown id: tags:
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.
Since TensorRT-LLM is a SDK for interacting with local models in process there are a few environment steps that must be followed to ensure that the TensorRT-LLM setup can be used.
1. Nvidia Cuda 12.2 or higher is currently required to run TensorRT-LLM
2. Install `tensorrt_llm` via pip with `pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com`
3. For this example we will use Llama2. The Llama2 model files need to be created via scripts following the instructions [here](https://github.com/NVIDIA/trt-llm-rag-windows/blob/release/1.0/README.md#building-trt-engine)
* The following files will be created from following the stop above
*`Llama_float16_tp1_rank0.engine`: The main output of the build script, containing the executable graph of operations with the model weights embedded.
*`config.json`: Includes detailed information about the model, like its general structure and precision, as well as information about which plug-ins were incorporated into the engine.
*`model.cache`: Caches some of the timing and optimization information from model compilation, making successive builds quicker.
4.`mkdir model`
5. Move all of the files mentioned above to the model directory.