{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "e92c26d9",
      "metadata": {
        "id": "e92c26d9"
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aurelio-labs/semantic-router/blob/main/docs/05-local-execution.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/aurelio-labs/semantic-router/blob/main/docs/05-local-execution.ipynb)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ee50410e-3f98-4d9c-8838-b38aebd6ce77",
      "metadata": {
        "id": "ee50410e-3f98-4d9c-8838-b38aebd6ce77"
      },
      "source": [
        "# Local Dynamic Routes\n",
        "\n",
        "## Fully local Semantic Router with `llama.cpp` and HuggingFace Encoder\n",
        "\n",
        "There are many reasons users might choose to roll their own LLMs rather than use a third-party service. Whether it's due to cost, privacy or compliance, Semantic Router supports the use of \"local\" LLMs through `llama.cpp`.\n",
        "\n",
        "Using `llama.cpp` also enables the use of quantized GGUF models, reducing the memory footprint of deployed models, allowing even 13-billion parameter models to run with hardware acceleration on an Apple M1 Pro chip.\n",
        "\n",
        "Below is an example of using semantic router with **Mistral-7B-Instruct**, quantized i."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "baa8d577-9f23-4dec-b167-fdecfb313c52",
      "metadata": {
        "id": "baa8d577-9f23-4dec-b167-fdecfb313c52"
      },
      "source": [
        "## Installing the library\n",
        "\n",
        "> Note: if you require hardware acceleration via BLAS, CUDA, Metal, etc. please refer to the [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python#installation-with-specific-hardware-acceleration-blas-cuda-metal-etc) repository README.md"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f95e4906-c3e6-4905-8f13-5e67d67069d5",
      "metadata": {
        "id": "f95e4906-c3e6-4905-8f13-5e67d67069d5"
      },
      "outputs": [],
      "source": [
        "!pip install -qU \"semantic-router[local]\""
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0029cc6d",
      "metadata": {
        "id": "0029cc6d"
      },
      "source": [
        "If you're running on Apple silicon you can run the following to run with Metal hardware acceleration:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "4f9b5729",
      "metadata": {
        "id": "4f9b5729",
        "outputId": "41dbb62c-7566-4b4a-95dc-7c45a003251d"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "'CMAKE_ARGS' is not recognized as an internal or external command,\n",
            "operable program or batch file.\n"
          ]
        }
      ],
      "source": [
        "!CMAKE_ARGS=\"-DLLAMA_METAL=on\""
      ]
    },
    {
      "cell_type": "markdown",
      "id": "d2f52f11-ae6d-4706-8da3-ce03a7a6b92d",
      "metadata": {
        "id": "d2f52f11-ae6d-4706-8da3-ce03a7a6b92d"
      },
      "source": [
        "## Download the Mistral 7B Instruct 4-bit GGUF files\n",
        "\n",
        "We will be using Mistral 7B Instruct, quantized as a 4-bit GGUF file, a good balance between performance and ability to deploy on consumer hardware"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1d6ddf61-c189-4b3b-99df-9508f830ae1f",
      "metadata": {
        "id": "1d6ddf61-c189-4b3b-99df-9508f830ae1f",
        "outputId": "85368c66-1b73-49f8-99ce-85a9165dcf67"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
            "                                 Dload  Upload   Total   Spent    Left  Speed\n",
            "\n",
            "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\n",
            "100  1168  100  1168    0     0   4171      0 --:--:-- --:--:-- --:--:--  4186\n",
            "\n",
            "  0 3918M    0 6382k    0     0  7990k      0  0:08:22 --:--:--  0:08:22 7990k\n",
            "  0 3918M    0 35.0M    0     0  19.4M      0  0:03:21  0:00:01  0:03:20 28.7M\n",
            "  2 3918M    2 78.7M    0     0  28.1M      0  0:02:19  0:00:02  0:02:17 36.2M\n",
            "  3 3918M    3  123M    0     0  32.4M      0  0:02:00  0:00:03  0:01:57 39.0M\n",
            "  4 3918M    4  171M    0     0  35.6M      0  0:01:49  0:00:04  0:01:45 41.2M\n",
            "  5 3918M    5  219M    0     0  37.7M      0  0:01:43  0:00:05  0:01:38 42.5M\n",
            "  6 3918M    6  271M    0     0  39.9M      0  0:01:38  0:00:06  0:01:32 47.3M\n",
            "  8 3918M    8  326M    0     0  41.8M      0  0:01:33  0:00:07  0:01:26 49.4M\n",
            "  9 3918M    9  379M    0     0  42.5M      0  0:01:31  0:00:08  0:01:23 50.1M\n",
            " 10 3918M   10  428M    0     0  43.7M      0  0:01:29  0:00:09  0:01:20 51.5M\n",
            " 12 3918M   12  490M    0     0  45.3M      0  0:01:26  0:00:10  0:01:16 54.2M\n",
            " 13 3918M   13  543M    0     0  46.0M      0  0:01:25  0:00:11  0:01:14 54.4M\n",
            " 15 3918M   15  609M    0     0  47.6M      0  0:01:22  0:00:12  0:01:10 56.6M\n",
            " 17 3918M   17  672M    0     0  48.7M      0  0:01:20  0:00:13  0:01:07 59.8M\n",
            " 18 3918M   18  733M    0     0  49.5M      0  0:01:19  0:00:14  0:01:05 60.8M\n",
            " 20 3918M   20  800M    0     0  50.6M      0  0:01:17  0:00:15  0:01:02 62.0M\n",
            " 21 3918M   21  861M    0     0  51.2M      0  0:01:16  0:00:16  0:01:00 63.4M\n",
            " 23 3918M   23  923M    0     0  51.8M      0  0:01:15  0:00:17  0:00:58 62.7M\n",
            " 24 3918M   24  973M    0     0  51.7M      0  0:01:15  0:00:18  0:00:57 60.1M\n",
            " 26 3918M   26 1039M    0     0  52.4M      0  0:01:14  0:00:19  0:00:55 61.2M\n",
            " 28 3918M   28 1101M    0     0  52.9M      0  0:01:14  0:00:20  0:00:54 60.1M\n",
            " 29 3918M   29 1156M    0     0  53.0M      0  0:01:13  0:00:21  0:00:52 59.0M\n",
            " 30 3918M   30 1208M    0     0  53.0M      0  0:01:13  0:00:22  0:00:51 57.0M\n",
            " 32 3918M   32 1272M    0     0  53.4M      0  0:01:13  0:00:23  0:00:50 59.9M\n",
            " 34 3918M   34 1337M    0     0  53.9M      0  0:01:12  0:00:24  0:00:48 59.5M\n",
            " 35 3918M   35 1397M    0     0  54.1M      0  0:01:12  0:00:25  0:00:47 59.1M\n",
            " 36 3918M   36 1444M    0     0  53.9M      0  0:01:12  0:00:26  0:00:46 57.6M\n",
            " 38 3918M   38 1506M    0     0  54.1M      0  0:01:12  0:00:27  0:00:45 59.2M\n",
            " 40 3918M   40 1569M    0     0  54.4M      0  0:01:11  0:00:28  0:00:43 59.3M\n",
            " 41 3918M   41 1630M    0     0  54.7M      0  0:01:11  0:00:29  0:00:42 58.7M\n",
            " 43 3918M   43 1697M    0     0  55.1M      0  0:01:11  0:00:30  0:00:41 60.0M\n",
            " 44 3918M   44 1759M    0     0  55.3M      0  0:01:10  0:00:31  0:00:39 63.0M\n",
            " 45 3918M   45 1798M    0     0  54.8M      0  0:01:11  0:00:32  0:00:39 58.6M\n",
            " 47 3918M   47 1866M    0     0  55.2M      0  0:01:10  0:00:33  0:00:37 59.4M\n",
            " 49 3918M   49 1930M    0     0  55.4M      0  0:01:10  0:00:34  0:00:36 59.9M\n",
            " 50 3918M   50 1994M    0     0  55.7M      0  0:01:10  0:00:35  0:00:35 59.4M\n",
            " 52 3918M   52 2062M    0     0  56.0M      0  0:01:09  0:00:36  0:00:33 60.4M\n",
            " 54 3918M   54 2121M    0     0  56.1M      0  0:01:09  0:00:37  0:00:32 64.6M\n",
            " 55 3918M   55 2186M    0     0  56.3M      0  0:01:09  0:00:38  0:00:31 63.9M\n",
            " 57 3918M   57 2248M    0     0  56.5M      0  0:01:09  0:00:39  0:00:30 63.6M\n",
            " 59 3918M   59 2317M    0     0  56.8M      0  0:01:08  0:00:40  0:00:28 64.6M\n",
            " 60 3918M   60 2382M    0     0  57.0M      0  0:01:08  0:00:41  0:00:27 64.1M\n",
            " 62 3918M   62 2435M    0     0  56.8M      0  0:01:08  0:00:42  0:00:26 62.7M\n",
            " 63 3918M   63 2496M    0     0  56.9M      0  0:01:08  0:00:43  0:00:25 62.0M\n",
            " 64 3918M   64 2546M    0     0  56.8M      0  0:01:08  0:00:44  0:00:24 59.5M\n",
            " 66 3918M   66 2596M    0     0  56.6M      0  0:01:09  0:00:45  0:00:24 55.7M\n",
            " 67 3918M   67 2646M    0     0  56.5M      0  0:01:09  0:00:46  0:00:23 52.7M\n",
            " 69 3918M   69 2708M    0     0  56.6M      0  0:01:09  0:00:47  0:00:22 54.7M\n",
            " 70 3918M   70 2773M    0     0  56.8M      0  0:01:08  0:00:48  0:00:20 55.4M\n",
            " 72 3918M   72 2833M    0     0  56.9M      0  0:01:08  0:00:49  0:00:19 57.4M\n",
            " 73 3918M   73 2896M    0     0  56.9M      0  0:01:08  0:00:50  0:00:18 59.7M\n",
            " 75 3918M   75 2952M    0     0  56.9M      0  0:01:08  0:00:51  0:00:17 61.1M\n",
            " 76 3918M   76 3012M    0     0  57.0M      0  0:01:08  0:00:52  0:00:16 60.3M\n",
            " 78 3918M   78 3078M    0     0  57.2M      0  0:01:08  0:00:53  0:00:15 60.9M\n",
            " 80 3918M   80 3140M    0     0  57.3M      0  0:01:08  0:00:54  0:00:14 61.3M\n",
            " 81 3918M   81 3196M    0     0  57.2M      0  0:01:08  0:00:55  0:00:13 60.1M\n",
            " 83 3918M   83 3261M    0     0  57.4M      0  0:01:08  0:00:56  0:00:12 61.7M\n",
            " 84 3918M   84 3324M    0     0  57.5M      0  0:01:08  0:00:57  0:00:11 62.8M\n",
            " 86 3918M   86 3390M    0     0  57.6M      0  0:01:07  0:00:58  0:00:09 62.4M\n",
            " 88 3918M   88 3456M    0     0  57.7M      0  0:01:07  0:00:59  0:00:08 63.1M\n",
            " 89 3918M   89 3520M    0     0  57.9M      0  0:01:07  0:01:00  0:00:07 64.8M\n",
            " 91 3918M   91 3585M    0     0  58.0M      0  0:01:07  0:01:01  0:00:06 64.9M\n",
            " 93 3918M   93 3654M    0     0  58.1M      0  0:01:07  0:01:02  0:00:05 65.8M\n",
            " 94 3918M   94 3719M    0     0  58.2M      0  0:01:07  0:01:03  0:00:04 65.7M\n",
            " 96 3918M   96 3778M    0     0  58.3M      0  0:01:07  0:01:04  0:00:03 64.5M\n",
            " 97 3918M   97 3828M    0     0  58.1M      0  0:01:07  0:01:05  0:00:02 61.6M\n",
            " 99 3918M   99 3888M    0     0  58.2M      0  0:01:07  0:01:06  0:00:01 60.6M\n",
            "100 3918M  100 3918M    0     0  58.2M      0  0:01:07  0:01:07 --:--:-- 58.8M\n",
            "'ls' is not recognized as an internal or external command,\n",
            "operable program or batch file.\n"
          ]
        }
      ],
      "source": [
        "! curl -L \"https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_0.gguf?download=true\" -o ./mistral-7b-instruct-v0.2.Q4_0.gguf\n",
        "! ls mistral-7b-instruct-v0.2.Q4_0.gguf"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f6842324-0a81-44fb-a220-905af77601af",
      "metadata": {
        "id": "f6842324-0a81-44fb-a220-905af77601af"
      },
      "source": [
        "# Initializing Dynamic Routes\n",
        "\n",
        "Similar to the `02-dynamic-routes.ipynb` notebook, we will be initializing some dynamic routes that make use of LLMs for function calling"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e26db664-9dff-476a-84ef-edd7a8cdf1ba",
      "metadata": {
        "id": "e26db664-9dff-476a-84ef-edd7a8cdf1ba",
        "outputId": "dd9fe6ce-0ac1-41bb-99d9-7dfa0cbb7840"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "c:\\Users\\Siraj\\Documents\\Personal\\Work\\Aurelio\\Virtual Environments\\semantic_router_3\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
            "  from .autonotebook import tqdm as notebook_tqdm\n"
          ]
        }
      ],
      "source": [
        "from datetime import datetime\n",
        "from zoneinfo import ZoneInfo\n",
        "\n",
        "from semantic_router import Route\n",
        "from semantic_router.utils.function_call import get_schema\n",
        "\n",
        "\n",
        "def get_time(timezone: str) -> str:\n",
        "    \"\"\"Finds the current time in a specific timezone.\n",
        "\n",
        "    :param timezone: The timezone to find the current time in, should\n",
        "        be a valid timezone from the IANA Time Zone Database like\n",
        "        \"America/New_York\" or \"Europe/London\". Do NOT put the place\n",
        "        name itself like \"rome\", or \"new york\", you must provide\n",
        "        the IANA format.\n",
        "    :type timezone: str\n",
        "    :return: The current time in the specified timezone.\"\"\"\n",
        "    now = datetime.now(ZoneInfo(timezone))\n",
        "    return now.strftime(\"%H:%M\")\n",
        "\n",
        "\n",
        "time_schema = get_schema(get_time)\n",
        "time_schema\n",
        "time = Route(\n",
        "    name=\"get_time\",\n",
        "    utterances=[\n",
        "        \"what is the time in new york city?\",\n",
        "        \"what is the time in london?\",\n",
        "        \"I live in Rome, what time is it?\",\n",
        "    ],\n",
        "    function_schemas=[time_schema],\n",
        ")\n",
        "\n",
        "politics = Route(\n",
        "    name=\"politics\",\n",
        "    utterances=[\n",
        "        \"isn't politics the best thing ever\",\n",
        "        \"why don't you tell me about your political opinions\",\n",
        "        \"don't you just love the president\" \"don't you just hate the president\",\n",
        "        \"they're going to destroy this country!\",\n",
        "        \"they will save the country!\",\n",
        "    ],\n",
        ")\n",
        "chitchat = Route(\n",
        "    name=\"chitchat\",\n",
        "    utterances=[\n",
        "        \"how's the weather today?\",\n",
        "        \"how are things going?\",\n",
        "        \"lovely weather today\",\n",
        "        \"the weather is horrendous\",\n",
        "        \"let's go to the chippy\",\n",
        "    ],\n",
        ")\n",
        "\n",
        "routes = [politics, chitchat, time]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "fac95b0c-c61f-4158-b7d9-0221f7d0b65e",
      "metadata": {
        "id": "fac95b0c-c61f-4158-b7d9-0221f7d0b65e",
        "outputId": "9f8ec42a-53d3-4fb0-ad30-26b91481d327"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'name': 'get_time',\n",
              " 'description': 'Finds the current time in a specific timezone.\\n\\n:param timezone: The timezone to find the current time in, should\\n    be a valid timezone from the IANA Time Zone Database like\\n    \"America/New_York\" or \"Europe/London\". Do NOT put the place\\n    name itself like \"rome\", or \"new york\", you must provide\\n    the IANA format.\\n:type timezone: str\\n:return: The current time in the specified timezone.',\n",
              " 'signature': '(timezone: str) -> str',\n",
              " 'output': \"<class 'str'>\"}"
            ]
          },
          "execution_count": 5,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "time_schema"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ddd15620-92bd-4b77-99f4-c3fe68e9ab62",
      "metadata": {
        "id": "ddd15620-92bd-4b77-99f4-c3fe68e9ab62"
      },
      "source": [
        "# Encoders\n",
        "\n",
        "You can use alternative Encoders, however, in this example we want to showcase a fully-local Semantic Router execution, so we are going to use a `HuggingFaceEncoder` with `sentence-transformers/all-MiniLM-L6-v2` (the default) as an embedding model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5253c141-141b-4fda-b07c-a313393902ed",
      "metadata": {
        "id": "5253c141-141b-4fda-b07c-a313393902ed"
      },
      "outputs": [],
      "source": [
        "from semantic_router.encoders import HuggingFaceEncoder\n",
        "\n",
        "encoder = HuggingFaceEncoder()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "512fb46e-352b-4740-971e-ad4d047aa03b",
      "metadata": {
        "id": "512fb46e-352b-4740-971e-ad4d047aa03b"
      },
      "source": [
        "# `llama.cpp` LLM\n",
        "\n",
        "From here, we can go ahead and instantiate our `llama-cpp-python` `llama_cpp.Llama` LLM, and then pass it to the `semantic_router.llms.LlamaCppLLM` wrapper class.\n",
        "\n",
        "For `llama_cpp.Llama`, there are a couple of parameters you should pay attention to:\n",
        "\n",
        "- `n_gpu_layers`: how many LLM layers to offload to the GPU (if you want to offload the entire model, pass `-1`, and for CPU execution, pass `0`)\n",
        "- `n_ctx`: context size, limit the number of tokens that can be passed to the LLM (this is bounded by the model's internal maximum context size, in this case for Mistral-7B-Instruct, 8000 tokens)\n",
        "- `verbose`: if `False`, silences output from `llama.cpp`\n",
        "\n",
        "> For other parameter explanation, refer to the `llama-cpp-python` [API Reference](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "772cec0d-7a0c-4c7e-9b7a-4a1864b0a8ec",
      "metadata": {
        "scrolled": true,
        "id": "772cec0d-7a0c-4c7e-9b7a-4a1864b0a8ec",
        "outputId": "ac1e6adb-a08f-490d-ba82-1bfd11a65e72"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ./mistral-7b-instruct-v0.2.Q4_0.gguf (version GGUF V3 (latest))\n",
            "llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.\n",
            "llama_model_loader: - kv   0:                       general.architecture str              = llama\n",
            "llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2\n",
            "llama_model_loader: - kv   2:                       llama.context_length u32              = 32768\n",
            "llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096\n",
            "llama_model_loader: - kv   4:                          llama.block_count u32              = 32\n",
            "llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336\n",
            "llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128\n",
            "llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32\n",
            "llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8\n",
            "llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010\n",
            "llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000\n",
            "llama_model_loader: - kv  11:                          general.file_type u32              = 2\n",
            "llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama\n",
            "llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = [\"<unk>\", \"<s>\", \"</s>\", \"<0x00>\", \"<...\n",
            "llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...\n",
            "llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...\n",
            "llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1\n",
            "llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2\n",
            "llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0\n",
            "llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0\n",
            "llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true\n",
            "llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false\n",
            "llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...\n",
            "llama_model_loader: - kv  23:               general.quantization_version u32              = 2\n",
            "llama_model_loader: - type  f32:   65 tensors\n",
            "llama_model_loader: - type q4_0:  225 tensors\n",
            "llama_model_loader: - type q6_K:    1 tensors\n",
            "llm_load_vocab: special tokens definition check successful ( 259/32000 ).\n",
            "llm_load_print_meta: format           = GGUF V3 (latest)\n",
            "llm_load_print_meta: arch             = llama\n",
            "llm_load_print_meta: vocab type       = SPM\n",
            "llm_load_print_meta: n_vocab          = 32000\n",
            "llm_load_print_meta: n_merges         = 0\n",
            "llm_load_print_meta: n_ctx_train      = 32768\n",
            "llm_load_print_meta: n_embd           = 4096\n",
            "llm_load_print_meta: n_head           = 32\n",
            "llm_load_print_meta: n_head_kv        = 8\n",
            "llm_load_print_meta: n_layer          = 32\n",
            "llm_load_print_meta: n_rot            = 128\n",
            "llm_load_print_meta: n_embd_head_k    = 128\n",
            "llm_load_print_meta: n_embd_head_v    = 128\n",
            "llm_load_print_meta: n_gqa            = 4\n",
            "llm_load_print_meta: n_embd_k_gqa     = 1024\n",
            "llm_load_print_meta: n_embd_v_gqa     = 1024\n",
            "llm_load_print_meta: f_norm_eps       = 0.0e+00\n",
            "llm_load_print_meta: f_norm_rms_eps   = 1.0e-05\n",
            "llm_load_print_meta: f_clamp_kqv      = 0.0e+00\n",
            "llm_load_print_meta: f_max_alibi_bias = 0.0e+00\n",
            "llm_load_print_meta: f_logit_scale    = 0.0e+00\n",
            "llm_load_print_meta: n_ff             = 14336\n",
            "llm_load_print_meta: n_expert         = 0\n",
            "llm_load_print_meta: n_expert_used    = 0\n",
            "llm_load_print_meta: causal attn      = 1\n",
            "llm_load_print_meta: pooling type     = 0\n",
            "llm_load_print_meta: rope type        = 0\n",
            "llm_load_print_meta: rope scaling     = linear\n",
            "llm_load_print_meta: freq_base_train  = 1000000.0\n",
            "llm_load_print_meta: freq_scale_train = 1\n",
            "llm_load_print_meta: n_yarn_orig_ctx  = 32768\n",
            "llm_load_print_meta: rope_finetuned   = unknown\n",
            "llm_load_print_meta: ssm_d_conv       = 0\n",
            "llm_load_print_meta: ssm_d_inner      = 0\n",
            "llm_load_print_meta: ssm_d_state      = 0\n",
            "llm_load_print_meta: ssm_dt_rank      = 0\n",
            "llm_load_print_meta: model type       = 8B\n",
            "llm_load_print_meta: model ftype      = Q4_0\n",
            "llm_load_print_meta: model params     = 7.24 B\n",
            "llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) \n",
            "llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2\n",
            "llm_load_print_meta: BOS token        = 1 '<s>'\n",
            "llm_load_print_meta: EOS token        = 2 '</s>'\n",
            "llm_load_print_meta: UNK token        = 0 '<unk>'\n",
            "llm_load_print_meta: PAD token        = 0 '<unk>'\n",
            "llm_load_print_meta: LF token         = 13 '<0x0A>'\n",
            "llm_load_tensors: ggml ctx size =    0.15 MiB\n",
            "llm_load_tensors:        CPU buffer size =  3917.87 MiB\n",
            "..................................................................................................\n",
            "llama_new_context_with_model: n_ctx      = 2048\n",
            "llama_new_context_with_model: n_batch    = 512\n",
            "llama_new_context_with_model: n_ubatch   = 512\n",
            "llama_new_context_with_model: flash_attn = 0\n",
            "llama_new_context_with_model: freq_base  = 1000000.0\n",
            "llama_new_context_with_model: freq_scale = 1\n",
            "llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB\n",
            "llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB\n",
            "llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB\n",
            "llama_new_context_with_model:        CPU compute buffer size =   164.01 MiB\n",
            "llama_new_context_with_model: graph nodes  = 1030\n",
            "llama_new_context_with_model: graph splits = 1\n",
            "AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | \n",
            "Model metadata: {'general.name': 'mistralai_mistral-7b-instruct-v0.2', 'general.architecture': 'llama', 'llama.context_length': '32768', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'llama.feed_forward_length': '14336', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '2', 'llama.attention.head_count_kv': '8', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.freq_base': '1000000.000000', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.chat_template': \"{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}\"}\n",
            "Guessed chat format: mistral-instruct\n",
            "\u001b[32m2024-05-07 15:50:07 INFO semantic_router.utils.logger local\u001b[0m\n"
          ]
        }
      ],
      "source": [
        "from semantic_router import RouteLayer\n",
        "\n",
        "from llama_cpp import Llama\n",
        "from semantic_router.llms.llamacpp import LlamaCppLLM\n",
        "\n",
        "enable_gpu = True  # offload LLM layers to the GPU (must fit in memory)\n",
        "\n",
        "_llm = Llama(\n",
        "    model_path=\"./mistral-7b-instruct-v0.2.Q4_0.gguf\",\n",
        "    n_gpu_layers=-1 if enable_gpu else 0,\n",
        "    n_ctx=2048,\n",
        ")\n",
        "_llm.verbose = False\n",
        "llm = LlamaCppLLM(name=\"Mistral-7B-v0.2-Instruct\", llm=_llm, max_tokens=None)\n",
        "\n",
        "rl = RouteLayer(encoder=encoder, routes=routes, llm=llm)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a8bd1da4-8ff7-4cd3-a5e3-fd79a938cc67",
      "metadata": {
        "id": "a8bd1da4-8ff7-4cd3-a5e3-fd79a938cc67",
        "outputId": "8f18c938-067b-4bdc-a2a2-299b346d7d54"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "RouteChoice(name='chitchat', function_call=None, similarity_score=None)"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "rl(\"how's the weather today?\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c6ccbea2-376b-4b28-9b79-d2e9c71e99f4",
      "metadata": {
        "scrolled": true,
        "id": "c6ccbea2-376b-4b28-9b79-d2e9c71e99f4",
        "outputId": "dcd7a550-3ad8-4814-c376-9a40d6262ef7"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "from_string grammar:\n",
            "root ::= object \n",
            "object ::= [{] ws object_11 [}] ws \n",
            "value ::= object | array | string | number | value_6 ws \n",
            "array ::= [[] ws array_15 []] ws \n",
            "string ::= [\"] string_18 [\"] ws \n",
            "number ::= number_19 number_25 number_29 ws \n",
            "value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l] \n",
            "ws ::= ws_31 \n",
            "object_8 ::= string [:] ws value object_10 \n",
            "object_9 ::= [,] ws string [:] ws value \n",
            "object_10 ::= object_9 object_10 | \n",
            "object_11 ::= object_8 | \n",
            "array_12 ::= value array_14 \n",
            "array_13 ::= [,] ws value \n",
            "array_14 ::= array_13 array_14 | \n",
            "array_15 ::= array_12 | \n",
            "string_16 ::= [^\"\\] | [\\] string_17 \n",
            "string_17 ::= [\"\\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] \n",
            "string_18 ::= string_16 string_18 | \n",
            "number_19 ::= number_20 number_21 \n",
            "number_20 ::= [-] | \n",
            "number_21 ::= [0-9] | [1-9] number_22 \n",
            "number_22 ::= [0-9] number_22 | \n",
            "number_23 ::= [.] number_24 \n",
            "number_24 ::= [0-9] number_24 | [0-9] \n",
            "number_25 ::= number_23 | \n",
            "number_26 ::= [eE] number_27 number_28 \n",
            "number_27 ::= [-+] | \n",
            "number_28 ::= [0-9] number_28 | [0-9] \n",
            "number_29 ::= number_26 | \n",
            "ws_30 ::= [ <U+0009><U+000A>] ws \n",
            "ws_31 ::= ws_30 | \n",
            "\n",
            "\u001b[32m2024-05-07 15:50:08 INFO semantic_router.utils.logger Extracting function input...\u001b[0m\n",
            "\u001b[32m2024-05-07 15:50:59 INFO semantic_router.utils.logger LLM output: {\n",
            "\t\"timezone\": \"America/New_York\"\n",
            "}\u001b[0m\n",
            "\u001b[32m2024-05-07 15:50:59 INFO semantic_router.utils.logger Function inputs: [{'timezone': 'America/New_York'}]\u001b[0m\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "name='get_time' function_call=[{'timezone': 'America/New_York'}] similarity_score=None\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "'07:50'"
            ]
          },
          "execution_count": 9,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "out = rl(\"what's the time in New York right now?\")\n",
        "print(out)\n",
        "get_time(**out.function_call[0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "720f976a",
      "metadata": {
        "id": "720f976a",
        "outputId": "b78d788f-8bd7-471d-c6b9-be23a2849762"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "from_string grammar:\n",
            "root ::= object \n",
            "object ::= [{] ws object_11 [}] ws \n",
            "value ::= object | array | string | number | value_6 ws \n",
            "array ::= [[] ws array_15 []] ws \n",
            "string ::= [\"] string_18 [\"] ws \n",
            "number ::= number_19 number_25 number_29 ws \n",
            "value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l] \n",
            "ws ::= ws_31 \n",
            "object_8 ::= string [:] ws value object_10 \n",
            "object_9 ::= [,] ws string [:] ws value \n",
            "object_10 ::= object_9 object_10 | \n",
            "object_11 ::= object_8 | \n",
            "array_12 ::= value array_14 \n",
            "array_13 ::= [,] ws value \n",
            "array_14 ::= array_13 array_14 | \n",
            "array_15 ::= array_12 | \n",
            "string_16 ::= [^\"\\] | [\\] string_17 \n",
            "string_17 ::= [\"\\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] \n",
            "string_18 ::= string_16 string_18 | \n",
            "number_19 ::= number_20 number_21 \n",
            "number_20 ::= [-] | \n",
            "number_21 ::= [0-9] | [1-9] number_22 \n",
            "number_22 ::= [0-9] number_22 | \n",
            "number_23 ::= [.] number_24 \n",
            "number_24 ::= [0-9] number_24 | [0-9] \n",
            "number_25 ::= number_23 | \n",
            "number_26 ::= [eE] number_27 number_28 \n",
            "number_27 ::= [-+] | \n",
            "number_28 ::= [0-9] number_28 | [0-9] \n",
            "number_29 ::= number_26 | \n",
            "ws_30 ::= [ <U+0009><U+000A>] ws \n",
            "ws_31 ::= ws_30 | \n",
            "\n",
            "\u001b[32m2024-05-07 15:50:59 INFO semantic_router.utils.logger Extracting function input...\u001b[0m\n",
            "\u001b[32m2024-05-07 15:51:27 INFO semantic_router.utils.logger LLM output: {\"timezone\": \"Europe/Rome\"}\u001b[0m\n",
            "\u001b[32m2024-05-07 15:51:27 INFO semantic_router.utils.logger Function inputs: [{'timezone': 'Europe/Rome'}]\u001b[0m\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "name='get_time' function_call=[{'timezone': 'Europe/Rome'}] similarity_score=None\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "'13:51'"
            ]
          },
          "execution_count": 10,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "out = rl(\"what's the time in Rome right now?\")\n",
        "print(out)\n",
        "get_time(**out.function_call[0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c9d9dbbb",
      "metadata": {
        "id": "c9d9dbbb",
        "outputId": "0813be04-ab70-4e1a-b018-7c99df77d74f"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "from_string grammar:\n",
            "root ::= object \n",
            "object ::= [{] ws object_11 [}] ws \n",
            "value ::= object | array | string | number | value_6 ws \n",
            "array ::= [[] ws array_15 []] ws \n",
            "string ::= [\"] string_18 [\"] ws \n",
            "number ::= number_19 number_25 number_29 ws \n",
            "value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l] \n",
            "ws ::= ws_31 \n",
            "object_8 ::= string [:] ws value object_10 \n",
            "object_9 ::= [,] ws string [:] ws value \n",
            "object_10 ::= object_9 object_10 | \n",
            "object_11 ::= object_8 | \n",
            "array_12 ::= value array_14 \n",
            "array_13 ::= [,] ws value \n",
            "array_14 ::= array_13 array_14 | \n",
            "array_15 ::= array_12 | \n",
            "string_16 ::= [^\"\\] | [\\] string_17 \n",
            "string_17 ::= [\"\\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] \n",
            "string_18 ::= string_16 string_18 | \n",
            "number_19 ::= number_20 number_21 \n",
            "number_20 ::= [-] | \n",
            "number_21 ::= [0-9] | [1-9] number_22 \n",
            "number_22 ::= [0-9] number_22 | \n",
            "number_23 ::= [.] number_24 \n",
            "number_24 ::= [0-9] number_24 | [0-9] \n",
            "number_25 ::= number_23 | \n",
            "number_26 ::= [eE] number_27 number_28 \n",
            "number_27 ::= [-+] | \n",
            "number_28 ::= [0-9] number_28 | [0-9] \n",
            "number_29 ::= number_26 | \n",
            "ws_30 ::= [ <U+0009><U+000A>] ws \n",
            "ws_31 ::= ws_30 | \n",
            "\n",
            "\u001b[32m2024-05-07 15:51:27 INFO semantic_router.utils.logger Extracting function input...\u001b[0m\n",
            "\u001b[32m2024-05-07 15:51:56 INFO semantic_router.utils.logger LLM output: {\"timezone\": \"Asia/Bangkok\"}\u001b[0m\n",
            "\u001b[32m2024-05-07 15:51:56 INFO semantic_router.utils.logger Function inputs: [{'timezone': 'Asia/Bangkok'}]\u001b[0m\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "name='get_time' function_call=[{'timezone': 'Asia/Bangkok'}] similarity_score=None\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "'18:51'"
            ]
          },
          "execution_count": 11,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "out = rl(\"what's the time in Bangkok right now?\")\n",
        "print(out)\n",
        "get_time(**out.function_call[0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "675d12fd",
      "metadata": {
        "id": "675d12fd",
        "outputId": "f588df23-ea7a-4ada-d6ac-f5fae6af1858"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "from_string grammar:\n",
            "root ::= object \n",
            "object ::= [{] ws object_11 [}] ws \n",
            "value ::= object | array | string | number | value_6 ws \n",
            "array ::= [[] ws array_15 []] ws \n",
            "string ::= [\"] string_18 [\"] ws \n",
            "number ::= number_19 number_25 number_29 ws \n",
            "value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l] \n",
            "ws ::= ws_31 \n",
            "object_8 ::= string [:] ws value object_10 \n",
            "object_9 ::= [,] ws string [:] ws value \n",
            "object_10 ::= object_9 object_10 | \n",
            "object_11 ::= object_8 | \n",
            "array_12 ::= value array_14 \n",
            "array_13 ::= [,] ws value \n",
            "array_14 ::= array_13 array_14 | \n",
            "array_15 ::= array_12 | \n",
            "string_16 ::= [^\"\\] | [\\] string_17 \n",
            "string_17 ::= [\"\\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] \n",
            "string_18 ::= string_16 string_18 | \n",
            "number_19 ::= number_20 number_21 \n",
            "number_20 ::= [-] | \n",
            "number_21 ::= [0-9] | [1-9] number_22 \n",
            "number_22 ::= [0-9] number_22 | \n",
            "number_23 ::= [.] number_24 \n",
            "number_24 ::= [0-9] number_24 | [0-9] \n",
            "number_25 ::= number_23 | \n",
            "number_26 ::= [eE] number_27 number_28 \n",
            "number_27 ::= [-+] | \n",
            "number_28 ::= [0-9] number_28 | [0-9] \n",
            "number_29 ::= number_26 | \n",
            "ws_30 ::= [ <U+0009><U+000A>] ws \n",
            "ws_31 ::= ws_30 | \n",
            "\n",
            "\u001b[32m2024-05-07 15:51:56 INFO semantic_router.utils.logger Extracting function input...\u001b[0m\n",
            "\u001b[32m2024-05-07 15:52:25 INFO semantic_router.utils.logger LLM output: {\n",
            "\t\"timezone\": \"Asia/Bangkok\"\n",
            "}\u001b[0m\n",
            "\u001b[32m2024-05-07 15:52:25 INFO semantic_router.utils.logger Function inputs: [{'timezone': 'Asia/Bangkok'}]\u001b[0m\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "name='get_time' function_call=[{'timezone': 'Asia/Bangkok'}] similarity_score=None\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "'18:52'"
            ]
          },
          "execution_count": 12,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "out = rl(\"what's the time in Phuket right now?\")\n",
        "print(out)\n",
        "get_time(**out.function_call[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5200f550-f3be-43d7-9b76-6390360f07c8",
      "metadata": {
        "id": "5200f550-f3be-43d7-9b76-6390360f07c8"
      },
      "source": [
        "## Cleanup"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "76df5f53",
      "metadata": {
        "id": "76df5f53"
      },
      "source": [
        "Once done, if you'd like to delete the downloaded model you can do so with the following:\n",
        "\n",
        "```\n",
        "! rm ./mistral-7b-instruct-v0.2.Q4_0.gguf\n",
        "```"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.4"
    },
    "colab": {
      "provenance": []
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}