<ahref="https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/llm/llama_2_llama_cpp.ipynb"target="_parent"><imgsrc="https://colab.research.google.com/assets/colab-badge.svg"alt="Open In Colab"/></a>
In this short notebook, we show how to use the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) library with LlamaIndex.
In this notebook, we use the [`llama-2-chat-13b-ggml`](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML) model, along with the proper prompt formatting.
Note that if you're using a version of `llama-cpp-python` after version `0.1.79`, the model format has changed from `ggmlv3` to `gguf`. Old model files like the used in this notebook can be converted using scripts in the [`llama.cpp`](https://github.com/ggerganov/llama.cpp) repo. Alternatively, you can download the GGUF version of the model above from [huggingface](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF).
By default, if model_path and model_url are blank, the `LlamaCPP` module will load llama2-chat-13B in either format depending on your version.
## Installation
To get the best performance out of `LlamaCPP`, it is recomended to install the package so that it is compilied with GPU support. A full guide for installing this way is [here](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal).
Full MACOS instructions are also [here](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/).
In general:
- Use `CuBLAS` if you have CUDA and an NVidia GPU
- Use `METAL` if you are running on an M1/M2 MacBook
- Use `CLBLAST` if you are running on an AMD/Intel GPU
The LlamaCPP llm is highly configurable. Depending on the model being used, you'll want to pass in `messages_to_prompt` and `completion_to_prompt` functions to help format the model inputs.
Since the default model is llama2-chat, we use the util functions found in [`llama_index.llms.llama_utils`](https://github.com/jerryjliu/llama_index/blob/main/llama_index/llms/llama_utils.py).
For any kwargs that need to be passed in during initialization, set them in `model_kwargs`. A full list of available model kwargs is available in the [LlamaCPP docs](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.llama.Llama.__init__).
For any kwargs that need to be passed in during inference, you can set them in `generate_kwargs`. See the full list of [generate kwargs here](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.llama.Llama.__call__).
In general, the defaults are a great starting point. The example below shows configuration with all defaults.
As noted above, we're using the [`llama-2-chat-13b-ggml`](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML) model in this notebook which uses the `ggmlv3` model format. If you are running a version of `llama-cpp-python` greater than `0.1.79`, you can replace the `model_url` below with `"https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf"`.
%% Cell type:markdown id:59b27895 tags:
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
response=query_engine.query("What did the author do growing up?")
print(response)
```
%% Output
Llama.generate: prefix-match hit
Based on the given context information, the author's childhood activities were writing short stories and programming. They wrote programs on punch cards using an early version of Fortran and later used a TRS-80 microcomputer to write simple games, a program to predict the height of model rockets, and a word processor that their father used to write at least one book.
llama_print_timings: load time = 1204.19 ms
llama_print_timings: sample time = 56.13 ms / 80 runs ( 0.70 ms per token, 1425.21 tokens per second)
llama_print_timings: prompt eval time = 65280.71 ms / 2272 tokens ( 28.73 ms per token, 34.80 tokens per second)
llama_print_timings: eval time = 6877.38 ms / 79 runs ( 87.06 ms per token, 11.49 tokens per second)