Skip to content
Snippets Groups Projects
Unverified Commit be26e8e3 authored by Kadir Nar's avatar Kadir Nar Committed by GitHub
Browse files

Update llama_2_llama_cpp.ipynb (#10745)

parent 415d23d1
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:3ac9adb4 tags:
<a href="https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/llm/llama_2_llama_cpp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
%% Cell type:markdown id:368686b4-f487-4dd4-aeff-37823976529d tags:
# LlamaCPP
In this short notebook, we show how to use the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) library with LlamaIndex.
In this notebook, we use the [`llama-2-chat-13b-ggml`](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML) model, along with the proper prompt formatting.
Note that if you're using a version of `llama-cpp-python` after version `0.1.79`, the model format has changed from `ggmlv3` to `gguf`. Old model files like the used in this notebook can be converted using scripts in the [`llama.cpp`](https://github.com/ggerganov/llama.cpp) repo. Alternatively, you can download the GGUF version of the model above from [huggingface](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF).
By default, if model_path and model_url are blank, the `LlamaCPP` module will load llama2-chat-13B in either format depending on your version.
## Installation
To get the best performance out of `LlamaCPP`, it is recomended to install the package so that it is compilied with GPU support. A full guide for installing this way is [here](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal).
Full MACOS instructions are also [here](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/).
In general:
- Use `CuBLAS` if you have CUDA and an NVidia GPU
- Use `METAL` if you are running on an M1/M2 MacBook
- Use `CLBLAST` if you are running on an AMD/Intel GPU
%% Cell type:code id:aff273be tags:
``` python
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-llama-cpp
```
%% Cell type:code id:40a33749 tags:
``` python
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.core.llms.llama_utils import (
from llama_index.llms.llama_cpp.llama_utils import (
messages_to_prompt,
completion_to_prompt,
)
```
%% Cell type:markdown id:e7927630-0044-41fb-a8a6-8dc3d2adb608 tags:
## Setup LLM
The LlamaCPP llm is highly configurable. Depending on the model being used, you'll want to pass in `messages_to_prompt` and `completion_to_prompt` functions to help format the model inputs.
Since the default model is llama2-chat, we use the util functions found in [`llama_index.llms.llama_utils`](https://github.com/jerryjliu/llama_index/blob/main/llama_index/llms/llama_utils.py).
For any kwargs that need to be passed in during initialization, set them in `model_kwargs`. A full list of available model kwargs is available in the [LlamaCPP docs](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.llama.Llama.__init__).
For any kwargs that need to be passed in during inference, you can set them in `generate_kwargs`. See the full list of [generate kwargs here](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.llama.Llama.__call__).
In general, the defaults are a great starting point. The example below shows configuration with all defaults.
As noted above, we're using the [`llama-2-chat-13b-ggml`](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML) model in this notebook which uses the `ggmlv3` model format. If you are running a version of `llama-cpp-python` greater than `0.1.79`, you can replace the `model_url` below with `"https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf"`.
%% Cell type:markdown id:59b27895 tags:
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%% Cell type:code id:439960c5 tags:
``` python
!pip install llama-index
```
%% Cell type:code id:2640c7a4 tags:
``` python
model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin"
```
%% Cell type:code id:6fa0ec4f-03ff-4e28-957f-b4b99a0faa20 tags:
``` python
llm = LlamaCPP(
# You can pass in the URL to a GGML model to download it automatically
model_url=model_url,
# optionally, you can set the path to a pre-downloaded model instead of model_url
model_path=None,
temperature=0.1,
max_new_tokens=256,
# llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
context_window=3900,
# kwargs to pass to __call__()
generate_kwargs={},
# kwargs to pass to __init__()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": 1},
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)
```
%% Output
llama.cpp: loading model from /Users/rchan/Library/Caches/llama_index/models/llama-2-13b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 3900
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 6983.72 MB (+ 3046.88 MB per state)
llama_new_context_with_model: kv self size = 3046.88 MB
ggml_metal_init: allocating
ggml_metal_init: loading '/Users/rchan/opt/miniconda3/envs/llama-index/lib/python3.10/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x14ff4f060
ggml_metal_init: loaded kernel_add_row 0x14ff4f2c0
ggml_metal_init: loaded kernel_mul 0x14ff4f520
ggml_metal_init: loaded kernel_mul_row 0x14ff4f780
ggml_metal_init: loaded kernel_scale 0x14ff4f9e0
ggml_metal_init: loaded kernel_silu 0x14ff4fc40
ggml_metal_init: loaded kernel_relu 0x14ff4fea0
ggml_metal_init: loaded kernel_gelu 0x11f7aef50
ggml_metal_init: loaded kernel_soft_max 0x11f7af380
ggml_metal_init: loaded kernel_diag_mask_inf 0x11f7af5e0
ggml_metal_init: loaded kernel_get_rows_f16 0x11f7af840
ggml_metal_init: loaded kernel_get_rows_q4_0 0x11f7afaa0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x13ffba0c0
ggml_metal_init: loaded kernel_get_rows_q2_K 0x13ffba320
ggml_metal_init: loaded kernel_get_rows_q3_K 0x13ffba580
ggml_metal_init: loaded kernel_get_rows_q4_K 0x13ffbaab0
ggml_metal_init: loaded kernel_get_rows_q5_K 0x13ffbaea0
ggml_metal_init: loaded kernel_get_rows_q6_K 0x13ffbb290
ggml_metal_init: loaded kernel_rms_norm 0x13ffbb690
ggml_metal_init: loaded kernel_norm 0x13ffbba80
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x13ffbc070
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x13ffbc510
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x11f7aff40
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x11f7b03e0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x11f7b0880
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x11f7b0d20
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x11f7b11c0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x11f7b1860
ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x11f7b1d40
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x11f7b2220
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x11f7b2700
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x11f7b2be0
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x11f7b30c0
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x11f7b35a0
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x11f7b3a80
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x11f7b3f60
ggml_metal_init: loaded kernel_rope 0x11f7b41c0
ggml_metal_init: loaded kernel_alibi_f32 0x11f7b47c0
ggml_metal_init: loaded kernel_cpy_f32_f16 0x11f7b4d90
ggml_metal_init: loaded kernel_cpy_f32_f32 0x11f7b5360
ggml_metal_init: loaded kernel_cpy_f16_f16 0x11f7b5930
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 356.03 MB
llama_new_context_with_model: max tensor size = 87.89 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB, ( 6984.50 / 21845.34)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.36 MB, ( 6985.86 / 21845.34)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 3048.88 MB, (10034.73 / 21845.34)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 354.70 MB, (10389.44 / 21845.34)
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
%% Cell type:markdown id:445453b1 tags:
We can tell that the model is using `metal` due to the logging!
%% Cell type:markdown id:5e2e6a78-7e5d-4915-bcbf-6087edb30276 tags:
## Start using our `LlamaCPP` LLM abstraction!
We can simply use the `complete` method of our `LlamaCPP` LLM abstraction to generate completions given a prompt.
%% Cell type:code id:5cfaf34c-0348-415e-98bb-83f782d64fe9 tags:
``` python
response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)
```
%% Output
Of course, I'd be happy to help! Here's a short poem about cats and dogs:
Cats and dogs, so different yet the same,
Both furry friends, with their own special game.
Cats purr and curl up tight,
Dogs wag their tails with delight.
Cats hunt mice with stealthy grace,
Dogs chase after balls with joyful pace.
But despite their differences, they share,
A love for play and a love so fair.
So here's to our feline and canine friends,
Both equally dear, and both equally grand.
llama_print_timings: load time = 1204.19 ms
llama_print_timings: sample time = 106.79 ms / 146 runs ( 0.73 ms per token, 1367.14 tokens per second)
llama_print_timings: prompt eval time = 1204.14 ms / 81 tokens ( 14.87 ms per token, 67.27 tokens per second)
llama_print_timings: eval time = 7468.88 ms / 145 runs ( 51.51 ms per token, 19.41 tokens per second)
llama_print_timings: total time = 8993.90 ms
%% Cell type:markdown id:9038f7d7 tags:
We can use the `stream_complete` endpoint to stream the response as it’s being generated rather than waiting for the entire response to be generated.
%% Cell type:code id:7b059409-cd9d-4651-979c-03b3943e94af tags:
``` python
response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
print(response.delta, end="", flush=True)
```
%% Output
Llama.generate: prefix-match hit
Sure! Here's a poem about fast cars:
Fast cars, sleek and strong
Racing down the highway all day long
Their engines purring smooth and sweet
As they speed through the streets
Their wheels grip the road with might
As they take off like a shot in flight
The wind rushes past with a roar
As they leave all else behind
With paint that shines like the sun
And lines that curve like a dream
They're a sight to behold, my son
These fast cars, so sleek and serene
So if you ever see one pass
Don't be afraid to give a cheer
For these machines of speed and grace
Are truly something to admire and revere.
llama_print_timings: load time = 1204.19 ms
llama_print_timings: sample time = 123.72 ms / 169 runs ( 0.73 ms per token, 1365.97 tokens per second)
llama_print_timings: prompt eval time = 267.03 ms / 14 tokens ( 19.07 ms per token, 52.43 tokens per second)
llama_print_timings: eval time = 8794.21 ms / 168 runs ( 52.35 ms per token, 19.10 tokens per second)
llama_print_timings: total time = 9485.38 ms
%% Cell type:markdown id:f7617600 tags:
## Query engine set up with LlamaCPP
We can simply pass in the `LlamaCPP` LLM abstraction to the `LlamaIndex` query engine as usual.
But first, let's change the global tokenizer to match our LLM.
%% Cell type:code id:d8ff0c0b tags:
``` python
from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer
set_global_tokenizer(
AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf").encode
)
```
%% Cell type:code id:d4c6f564 tags:
``` python
# use Huggingface embeddings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
```
%% Cell type:code id:5d485f1e tags:
``` python
# load documents
documents = SimpleDirectoryReader(
"../../../examples/paul_graham_essay/data"
).load_data()
```
%% Cell type:code id:c55c33cd tags:
``` python
# create vector store index
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
```
%% Cell type:code id:e07659c8 tags:
``` python
# set up query engine
query_engine = index.as_query_engine(llm=llm)
```
%% Cell type:code id:64e095c5 tags:
``` python
response = query_engine.query("What did the author do growing up?")
print(response)
```
%% Output
Llama.generate: prefix-match hit
Based on the given context information, the author's childhood activities were writing short stories and programming. They wrote programs on punch cards using an early version of Fortran and later used a TRS-80 microcomputer to write simple games, a program to predict the height of model rockets, and a word processor that their father used to write at least one book.
llama_print_timings: load time = 1204.19 ms
llama_print_timings: sample time = 56.13 ms / 80 runs ( 0.70 ms per token, 1425.21 tokens per second)
llama_print_timings: prompt eval time = 65280.71 ms / 2272 tokens ( 28.73 ms per token, 34.80 tokens per second)
llama_print_timings: eval time = 6877.38 ms / 79 runs ( 87.06 ms per token, 11.49 tokens per second)
llama_print_timings: total time = 72315.85 ms
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment