Merge pull request #157 from aurelio-labs/fix_llm_verbose_bug

fix: llm verbose bug

Merge pull request #157 from aurelio-labs/fix_llm_verbose_bug
0f882e22 · James Briggs · GitHub · 8b4e60bc · 2df6b48f · 0f882e22
Unverified Commit 0f882e22 authored 1 year ago by James Briggs Committed by GitHub 1 year ago
--- a/docs/05-local-execution.ipynb
+++ b/docs/05-local-execution.ipynb
@@ -350,8 +350,8 @@
    "    model_path=\"./mistral-7b-instruct-v0.2.Q4_0.gguf\",\n",
    "    n_gpu_layers=-1 if enable_gpu else 0,\n",
    "    n_ctx=2048,\n",
-    "    verbose=False,\n",
    ")\n",
+    "_llm.verbose=False\n",
    "llm = LlamaCppLLM(name=\"Mistral-7B-v0.2-Instruct\", llm=_llm, max_tokens=None)\n",
    "\n",
    "rl = RouteLayer(encoder=encoder, routes=routes, llm=llm)"

 %% Cell type:markdown id:e92c26d9 tags:

 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aurelio-labs/semantic-router/blob/main/docs/05-local-execution.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/aurelio-labs/semantic-router/blob/main/docs/05-local-execution.ipynb)

 %% Cell type:markdown id:ee50410e-3f98-4d9c-8838-b38aebd6ce77 tags:

 # Local Dynamic Routes

 ## Fully local Semantic Router with `llama.cpp` and HuggingFace Encoder

 There are many reasons users might choose to roll their own LLMs rather than use a third-party service. Whether it's due to cost, privacy or compliance, Semantic Router supports the use of "local" LLMs through `llama.cpp`.

 Using `llama.cpp` also enables the use of quantized GGUF models, reducing the memory footprint of deployed models, allowing even 13-billion parameter models to run with hardware acceleration on an Apple M1 Pro chip.

 Below is an example of using semantic router with **Mistral-7B-Instruct**, quantized i.

 %% Cell type:markdown id:baa8d577-9f23-4dec-b167-fdecfb313c52 tags:

 ## Installing the library

 > Note: if you require hardware acceleration via BLAS, CUDA, Metal, etc. please refer to the [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python#installation-with-specific-hardware-acceleration-blas-cuda-metal-etc) repository README.md

 %% Cell type:code id:f95e4906-c3e6-4905-8f13-5e67d67069d5 tags:

 ``` python
 !pip install -qU "semantic-router[local]==0.0.20"
 ```

 %% Cell type:markdown id:0029cc6d tags:

 If you're running on Apple silicon you can run the following to run with Metal hardware acceleration:

 %% Cell type:code id:4f9b5729 tags:

 ``` python
 !CMAKE_ARGS="-DLLAMA_METAL=on"
 ```

 %% Cell type:markdown id:d2f52f11-ae6d-4706-8da3-ce03a7a6b92d tags:

 ## Download the Mistral 7B Instruct 4-bit GGUF files

 We will be using Mistral 7B Instruct, quantized as a 4-bit GGUF file, a good balance between performance and ability to deploy on consumer hardware

 %% Cell type:code id:1d6ddf61-c189-4b3b-99df-9508f830ae1f tags:

 ``` python
 ! curl -L "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_0.gguf?download=true" -o ./mistral-7b-instruct-v0.2.Q4_0.gguf
 ! ls mistral-7b-instruct-v0.2.Q4_0.gguf
 ```

 %% Cell type:markdown id:f6842324-0a81-44fb-a220-905af77601af tags:

 # Initializing Dynamic Routes

 Similar to the `02-dynamic-routes.ipynb` notebook, we will be initializing some dynamic routes that make use of LLMs for function calling

 %% Cell type:code id:e26db664-9dff-476a-84ef-edd7a8cdf1ba tags:

 ``` python
 from datetime import datetime
 from zoneinfo import ZoneInfo

 from semantic_router import Route
 from semantic_router.utils.function_call import get_schema


 def get_time(timezone: str) -> str:
    """Finds the current time in a specific timezone.

    :param timezone: The timezone to find the current time in, should
        be a valid timezone from the IANA Time Zone Database like
        "America/New_York" or "Europe/London". Do NOT put the place
        name itself like "rome", or "new york", you must provide
        the IANA format.
    :type timezone: str
    :return: The current time in the specified timezone."""
    now = datetime.now(ZoneInfo(timezone))
    return now.strftime("%H:%M")


 time_schema = get_schema(get_time)
 time_schema
 time = Route(
    name="get_time",
    utterances=[
        "what is the time in new york city?",
        "what is the time in london?",
        "I live in Rome, what time is it?",
    ],
    function_schema=time_schema,
 )

 politics = Route(
    name="politics",
    utterances=[
        "isn't politics the best thing ever",
        "why don't you tell me about your political opinions",
        "don't you just love the president" "don't you just hate the president",
        "they're going to destroy this country!",
        "they will save the country!",
    ],
 )
 chitchat = Route(
    name="chitchat",
    utterances=[
        "how's the weather today?",
        "how are things going?",
        "lovely weather today",
        "the weather is horrendous",
        "let's go to the chippy",
    ],
 )

 routes = [politics, chitchat, time]
 ```

 %% Cell type:code id:fac95b0c-c61f-4158-b7d9-0221f7d0b65e tags:

 ``` python
 time_schema
 ```

 %% Output

    {'name': 'get_time',
     'description': 'Finds the current time in a specific timezone.\n\n:param timezone: The timezone to find the current time in, should\n    be a valid timezone from the IANA Time Zone Database like\n    "America/New_York" or "Europe/London". Do NOT put the place\n    name itself like "rome", or "new york", you must provide\n    the IANA format.\n:type timezone: str\n:return: The current time in the specified timezone.',
     'signature': '(timezone: str) -> str',
     'output': "<class 'str'>"}

 %% Cell type:markdown id:ddd15620-92bd-4b77-99f4-c3fe68e9ab62 tags:

 # Encoders

 You can use alternative Encoders, however, in this example we want to showcase a fully-local Semantic Router execution, so we are going to use a `HuggingFaceEncoder` with `sentence-transformers/all-MiniLM-L6-v2` (the default) as an embedding model.

 %% Cell type:code id:5253c141-141b-4fda-b07c-a313393902ed tags:

 ``` python
 from semantic_router.encoders import HuggingFaceEncoder

 encoder = HuggingFaceEncoder()
 ```

 %% Output

    /Users/jamesbriggs/opt/anaconda3/envs/decision-layer/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
      from .autonotebook import tqdm as notebook_tqdm

 %% Cell type:markdown id:512fb46e-352b-4740-971e-ad4d047aa03b tags:

 # `llama.cpp` LLM

 From here, we can go ahead and instantiate our `llama-cpp-python` `llama_cpp.Llama` LLM, and then pass it to the `semantic_router.llms.LlamaCppLLM` wrapper class.

 For `llama_cpp.Llama`, there are a couple of parameters you should pay attention to:

 - `n_gpu_layers`: how many LLM layers to offload to the GPU (if you want to offload the entire model, pass `-1`, and for CPU execution, pass `0`)
 - `n_ctx`: context size, limit the number of tokens that can be passed to the LLM (this is bounded by the model's internal maximum context size, in this case for Mistral-7B-Instruct, 8000 tokens)
 - `verbose`: if `False`, silences output from `llama.cpp`

 > For other parameter explanation, refer to the `llama-cpp-python` [API Reference](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/)

 %% Cell type:code id:772cec0d-7a0c-4c7e-9b7a-4a1864b0a8ec tags:

 ``` python
 from semantic_router import RouteLayer

 from llama_cpp import Llama
 from semantic_router.llms.llamacpp import LlamaCppLLM

 enable_gpu = True  # offload LLM layers to the GPU (must fit in memory)

 _llm = Llama(
    model_path="./mistral-7b-instruct-v0.2.Q4_0.gguf",
    n_gpu_layers=-1 if enable_gpu else 0,
    n_ctx=2048,
-    verbose=False,
 )
+_llm.verbose=False
 llm = LlamaCppLLM(name="Mistral-7B-v0.2-Instruct", llm=_llm, max_tokens=None)

 rl = RouteLayer(encoder=encoder, routes=routes, llm=llm)
 ```

 %% Output

    llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ./mistral-7b-instruct-v0.2.Q4_0.gguf (version GGUF V3 (latest))
    llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
    llama_model_loader: - kv   0:                       general.architecture str              = llama
    llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
    llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
    llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
    llama_model_loader: - kv   4:                          llama.block_count u32              = 32
    llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
    llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
    llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
    llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
    llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
    llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
    llama_model_loader: - kv  11:                          general.file_type u32              = 2
    llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
    llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
    llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
    llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
    llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
    llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
    llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
    llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
    llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
    llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
    llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
    llama_model_loader: - kv  23:               general.quantization_version u32              = 2
    llama_model_loader: - type  f32:   65 tensors
    llama_model_loader: - type q4_0:  225 tensors
    llama_model_loader: - type q6_K:    1 tensors
    llm_load_vocab: special tokens definition check successful ( 259/32000 ).
    llm_load_print_meta: format           = GGUF V3 (latest)
    llm_load_print_meta: arch             = llama
    llm_load_print_meta: vocab type       = SPM
    llm_load_print_meta: n_vocab          = 32000
    llm_load_print_meta: n_merges         = 0
    llm_load_print_meta: n_ctx_train      = 32768
    llm_load_print_meta: n_embd           = 4096
    llm_load_print_meta: n_head           = 32
    llm_load_print_meta: n_head_kv        = 8
    llm_load_print_meta: n_layer          = 32
    llm_load_print_meta: n_rot            = 128
    llm_load_print_meta: n_embd_head_k    = 128
    llm_load_print_meta: n_embd_head_v    = 128
    llm_load_print_meta: n_gqa            = 4
    llm_load_print_meta: n_embd_k_gqa     = 1024
    llm_load_print_meta: n_embd_v_gqa     = 1024
    llm_load_print_meta: f_norm_eps       = 0.0e+00
    llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
    llm_load_print_meta: f_clamp_kqv      = 0.0e+00
    llm_load_print_meta: f_max_alibi_bias = 0.0e+00
    llm_load_print_meta: n_ff             = 14336
    llm_load_print_meta: n_expert         = 0
    llm_load_print_meta: n_expert_used    = 0
    llm_load_print_meta: rope scaling     = linear
    llm_load_print_meta: freq_base_train  = 1000000.0
    llm_load_print_meta: freq_scale_train = 1
    llm_load_print_meta: n_yarn_orig_ctx  = 32768
    llm_load_print_meta: rope_finetuned   = unknown
    llm_load_print_meta: model type       = 7B
    llm_load_print_meta: model ftype      = Q4_0
    llm_load_print_meta: model params     = 7.24 B
    llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
    llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2
    llm_load_print_meta: BOS token        = 1 '<s>'
    llm_load_print_meta: EOS token        = 2 '</s>'
    llm_load_print_meta: UNK token        = 0 '<unk>'
    llm_load_print_meta: PAD token        = 0 '<unk>'
    llm_load_print_meta: LF token         = 13 '<0x0A>'
    llm_load_tensors: ggml ctx size       =    0.11 MiB
    ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  3918.58 MiB, ( 3918.64 / 21845.34)
    llm_load_tensors: system memory used  = 3917.98 MiB
    ..................................................................................................
    llama_new_context_with_model: n_ctx      = 2048
    llama_new_context_with_model: freq_base  = 1000000.0
    llama_new_context_with_model: freq_scale = 1
    ggml_metal_init: allocating
    ggml_metal_init: found device: Apple M1 Max
    ggml_metal_init: picking default device: Apple M1 Max
    ggml_metal_init: default.metallib not found, loading from source
    ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
    ggml_metal_init: loading '/Users/jamesbriggs/opt/anaconda3/envs/decision-layer/lib/python3.11/site-packages/llama_cpp/ggml-metal.metal'
    ggml_metal_init: GPU name:   Apple M1 Max
    ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
    ggml_metal_init: hasUnifiedMemory              = true
    ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
    ggml_metal_init: maxTransferRate               = built-in GPU
    ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   256.00 MiB, ( 4176.20 / 21845.34)
    llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
    ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 4176.22 / 21845.34)
    llama_build_graph: non-view tensors processed: 676/676
    llama_new_context_with_model: compute buffer total size = 159.19 MiB
    ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   156.02 MiB, ( 4332.22 / 21845.34)
    [32m2024-01-13 16:40:52 INFO semantic_router.utils.logger Initializing RouteLayer[0m

 %% Cell type:code id:a8bd1da4-8ff7-4cd3-a5e3-fd79a938cc67 tags:

 ``` python
 rl("how's the weather today?")
 ```

 %% Output

    RouteChoice(name='chitchat', function_call=None, similarity_score=None, trigger=None)

 %% Cell type:code id:c6ccbea2-376b-4b28-9b79-d2e9c71e99f4 tags:

 ``` python
 out = rl("what's the time in New York right now?")
 print(out)
 get_time(**out.function_call)
 ```

 %% Output

    from_string grammar:
    root ::= object
    object ::= [{] ws object_11 [}] ws
    value ::= object | array | string | number | value_6 ws
    array ::= [[] ws array_15 []] ws
    string ::= ["] string_18 ["] ws
    number ::= number_19 number_25 number_29 ws
    value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l]
    ws ::= ws_31
    object_8 ::= string [:] ws value object_10
    object_9 ::= [,] ws string [:] ws value
    object_10 ::= object_9 object_10 |
    object_11 ::= object_8 |
    array_12 ::= value array_14
    array_13 ::= [,] ws value
    array_14 ::= array_13 array_14 |
    array_15 ::= array_12 |
    string_16 ::= [^"\] | [\] string_17
    string_17 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]
    string_18 ::= string_16 string_18 |
    number_19 ::= number_20 number_21
    number_20 ::= [-] |
    number_21 ::= [0-9] | [1-9] number_22
    number_22 ::= [0-9] number_22 |
    number_23 ::= [.] number_24
    number_24 ::= [0-9] number_24 | [0-9]
    number_25 ::= number_23 |
    number_26 ::= [eE] number_27 number_28
    number_27 ::= [-+] |
    number_28 ::= [0-9] number_28 | [0-9]
    number_29 ::= number_26 |
    ws_30 ::= [ <U+0009><U+000A>] ws
    ws_31 ::= ws_30 |
    
    [32m2024-01-13 16:41:01 INFO semantic_router.utils.logger Extracting function input...[0m

    name='get_time' function_call={'timezone': 'America/New_York'} similarity_score=None trigger=None

    '11:41'

 %% Cell type:code id:720f976a tags:

 ``` python
 out = rl("what's the time in Rome right now?")
 print(out)
 get_time(**out.function_call)
 ```

 %% Output

    from_string grammar:
    root ::= object
    object ::= [{] ws object_11 [}] ws
    value ::= object | array | string | number | value_6 ws
    array ::= [[] ws array_15 []] ws
    string ::= ["] string_18 ["] ws
    number ::= number_19 number_25 number_29 ws
    value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l]
    ws ::= ws_31
    object_8 ::= string [:] ws value object_10
    object_9 ::= [,] ws string [:] ws value
    object_10 ::= object_9 object_10 |
    object_11 ::= object_8 |
    array_12 ::= value array_14
    array_13 ::= [,] ws value
    array_14 ::= array_13 array_14 |
    array_15 ::= array_12 |
    string_16 ::= [^"\] | [\] string_17
    string_17 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]
    string_18 ::= string_16 string_18 |
    number_19 ::= number_20 number_21
    number_20 ::= [-] |
    number_21 ::= [0-9] | [1-9] number_22
    number_22 ::= [0-9] number_22 |
    number_23 ::= [.] number_24
    number_24 ::= [0-9] number_24 | [0-9]
    number_25 ::= number_23 |
    number_26 ::= [eE] number_27 number_28
    number_27 ::= [-+] |
    number_28 ::= [0-9] number_28 | [0-9]
    number_29 ::= number_26 |
    ws_30 ::= [ <U+0009><U+000A>] ws
    ws_31 ::= ws_30 |
    
    [32m2024-01-13 16:41:04 INFO semantic_router.utils.logger Extracting function input...[0m

    name='get_time' function_call={'timezone': 'Europe/Rome'} similarity_score=None trigger=None

    '17:41'

 %% Cell type:code id:c9d9dbbb tags:

 ``` python
 out = rl("what's the time in Bangkok right now?")
 print(out)
 get_time(**out.function_call)
 ```

 %% Output

    from_string grammar:
    root ::= object
    object ::= [{] ws object_11 [}] ws
    value ::= object | array | string | number | value_6 ws
    array ::= [[] ws array_15 []] ws
    string ::= ["] string_18 ["] ws
    number ::= number_19 number_25 number_29 ws
    value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l]
    ws ::= ws_31
    object_8 ::= string [:] ws value object_10
    object_9 ::= [,] ws string [:] ws value
    object_10 ::= object_9 object_10 |
    object_11 ::= object_8 |
    array_12 ::= value array_14
    array_13 ::= [,] ws value
    array_14 ::= array_13 array_14 |
    array_15 ::= array_12 |
    string_16 ::= [^"\] | [\] string_17
    string_17 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]
    string_18 ::= string_16 string_18 |
    number_19 ::= number_20 number_21
    number_20 ::= [-] |
    number_21 ::= [0-9] | [1-9] number_22
    number_22 ::= [0-9] number_22 |
    number_23 ::= [.] number_24
    number_24 ::= [0-9] number_24 | [0-9]
    number_25 ::= number_23 |
    number_26 ::= [eE] number_27 number_28
    number_27 ::= [-+] |
    number_28 ::= [0-9] number_28 | [0-9]
    number_29 ::= number_26 |
    ws_30 ::= [ <U+0009><U+000A>] ws
    ws_31 ::= ws_30 |
    
    [32m2024-01-13 16:41:05 INFO semantic_router.utils.logger Extracting function input...[0m

    name='get_time' function_call={'timezone': 'Asia/Bangkok'} similarity_score=None trigger=None

    '23:41'

 %% Cell type:code id:675d12fd tags:

 ``` python
 out = rl("what's the time in Phuket right now?")
 print(out)
 get_time(**out.function_call)
 ```

 %% Output

    from_string grammar:
    root ::= object
    object ::= [{] ws object_11 [}] ws
    value ::= object | array | string | number | value_6 ws
    array ::= [[] ws array_15 []] ws
    string ::= ["] string_18 ["] ws
    number ::= number_19 number_25 number_29 ws
    value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l]
    ws ::= ws_31
    object_8 ::= string [:] ws value object_10
    object_9 ::= [,] ws string [:] ws value
    object_10 ::= object_9 object_10 |
    object_11 ::= object_8 |
    array_12 ::= value array_14
    array_13 ::= [,] ws value
    array_14 ::= array_13 array_14 |
    array_15 ::= array_12 |
    string_16 ::= [^"\] | [\] string_17
    string_17 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]
    string_18 ::= string_16 string_18 |
    number_19 ::= number_20 number_21
    number_20 ::= [-] |
    number_21 ::= [0-9] | [1-9] number_22
    number_22 ::= [0-9] number_22 |
    number_23 ::= [.] number_24
    number_24 ::= [0-9] number_24 | [0-9]
    number_25 ::= number_23 |
    number_26 ::= [eE] number_27 number_28
    number_27 ::= [-+] |
    number_28 ::= [0-9] number_28 | [0-9]
    number_29 ::= number_26 |
    ws_30 ::= [ <U+0009><U+000A>] ws
    ws_31 ::= ws_30 |
    
    [32m2024-01-13 16:41:07 INFO semantic_router.utils.logger Extracting function input...[0m

    name='get_time' function_call={'timezone': 'Asia/Bangkok'} similarity_score=None trigger=None

    '23:41'

 %% Cell type:markdown id:5200f550-f3be-43d7-9b76-6390360f07c8 tags:

 ## Cleanup

 %% Cell type:markdown id:76df5f53 tags:

 Once done, if you'd like to delete the downloaded model you can do so with the following:

 ```
 ! rm ./mistral-7b-instruct-v0.2.Q4_0.gguf
 ```