diff --git a/docs/05-local-execution.ipynb b/docs/05-local-execution.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..ce36bccb8a88ca7cd728f1263f351098456c805e
--- /dev/null
+++ b/docs/05-local-execution.ipynb
@@ -0,0 +1,699 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ee50410e-3f98-4d9c-8838-b38aebd6ce77",
+   "metadata": {},
+   "source": [
+    "# Local execution with `llama.cpp` and HuggingFace Encoder\n",
+    "\n",
+    "There are many reasons users might choose to roll their own LLMs rather than use a third-party service. Whether it's due to cost, privacy or compliance, Semantic Router supports the use of \"local\" LLMs through `llama.cpp`.\n",
+    "\n",
+    "Using `llama.cpp` also enables the use of quantized GGUF models, reducing the memory footprint of deployed models, allowing even 13-billion parameter models to run with hardware acceleration on an Apple M1 Pro chip.\n",
+    "\n",
+    "Below is an example of using semantic router with **Mistral-7B-Instruct**, quantized i."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "baa8d577-9f23-4dec-b167-fdecfb313c52",
+   "metadata": {},
+   "source": [
+    "## Installing the library\n",
+    "\n",
+    "> Note: if you require hardware acceleration via BLAS, CUDA, Metal, etc. please refer to the [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python#installation-with-specific-hardware-acceleration-blas-cuda-metal-etc) repository README.md"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f95e4906-c3e6-4905-8f13-5e67d67069d5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -qU \"semantic-router[local]==0.0.16\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0029cc6d",
+   "metadata": {},
+   "source": [
+    "If you're running on Apple silicon you can run the following to run with Metal hardware acceleration:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "4f9b5729",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!CMAKE_ARGS=\"-DLLAMA_METAL=on\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2f52f11-ae6d-4706-8da3-ce03a7a6b92d",
+   "metadata": {},
+   "source": [
+    "## Download the Mistral 7B Instruct 4-bit GGUF files\n",
+    "\n",
+    "We will be using Mistral 7B Instruct, quantized as a 4-bit GGUF file, a good balance between performance and ability to deploy on consumer hardware"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1d6ddf61-c189-4b3b-99df-9508f830ae1f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! curl -L \"https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_0.gguf?download=true\" -o ./mistral-7b-instruct-v0.2.Q4_0.gguf\n",
+    "! ls mistral-7b-instruct-v0.2.Q4_0.gguf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f6842324-0a81-44fb-a220-905af77601af",
+   "metadata": {},
+   "source": [
+    "# Initializing Dynamic Routes\n",
+    "\n",
+    "Similar to the `02-dynamic-routes.ipynb` notebook, we will be initializing some dynamic routes that make use of LLMs for function calling"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "e26db664-9dff-476a-84ef-edd7a8cdf1ba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import datetime\n",
+    "from zoneinfo import ZoneInfo\n",
+    "\n",
+    "from semantic_router import Route\n",
+    "from semantic_router.utils.function_call import get_schema\n",
+    "\n",
+    "\n",
+    "def get_time(timezone: str) -> str:\n",
+    "    \"\"\"Finds the current time in a specific timezone.\n",
+    "\n",
+    "    :param timezone: The timezone to find the current time in, should\n",
+    "        be a valid timezone from the IANA Time Zone Database like\n",
+    "        \"America/New_York\" or \"Europe/London\". Do NOT put the place\n",
+    "        name itself like \"rome\", or \"new york\", you must provide\n",
+    "        the IANA format.\n",
+    "    :type timezone: str\n",
+    "    :return: The current time in the specified timezone.\"\"\"\n",
+    "    now = datetime.now(ZoneInfo(timezone))\n",
+    "    return now.strftime(\"%H:%M\")\n",
+    "\n",
+    "\n",
+    "time_schema = get_schema(get_time)\n",
+    "time_schema\n",
+    "time = Route(\n",
+    "    name=\"get_time\",\n",
+    "    utterances=[\n",
+    "        \"what is the time in new york city?\",\n",
+    "        \"what is the time in london?\",\n",
+    "        \"I live in Rome, what time is it?\",\n",
+    "    ],\n",
+    "    function_schema=time_schema,\n",
+    ")\n",
+    "\n",
+    "politics = Route(\n",
+    "    name=\"politics\",\n",
+    "    utterances=[\n",
+    "        \"isn't politics the best thing ever\",\n",
+    "        \"why don't you tell me about your political opinions\",\n",
+    "        \"don't you just love the president\" \"don't you just hate the president\",\n",
+    "        \"they're going to destroy this country!\",\n",
+    "        \"they will save the country!\",\n",
+    "    ],\n",
+    ")\n",
+    "chitchat = Route(\n",
+    "    name=\"chitchat\",\n",
+    "    utterances=[\n",
+    "        \"how's the weather today?\",\n",
+    "        \"how are things going?\",\n",
+    "        \"lovely weather today\",\n",
+    "        \"the weather is horrendous\",\n",
+    "        \"let's go to the chippy\",\n",
+    "    ],\n",
+    ")\n",
+    "\n",
+    "routes = [politics, chitchat, time]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "fac95b0c-c61f-4158-b7d9-0221f7d0b65e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'name': 'get_time',\n",
+       " 'description': 'Finds the current time in a specific timezone.\\n\\n:param timezone: The timezone to find the current time in, should\\n    be a valid timezone from the IANA Time Zone Database like\\n    \"America/New_York\" or \"Europe/London\". Do NOT put the place\\n    name itself like \"rome\", or \"new york\", you must provide\\n    the IANA format.\\n:type timezone: str\\n:return: The current time in the specified timezone.',\n",
+       " 'signature': '(timezone: str) -> str',\n",
+       " 'output': \"<class 'str'>\"}"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "time_schema"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ddd15620-92bd-4b77-99f4-c3fe68e9ab62",
+   "metadata": {},
+   "source": [
+    "# Encoders\n",
+    "\n",
+    "You can use alternative Encoders, however, in this example we want to showcase a fully-local Semantic Router execution, so we are going to use a `HuggingFaceEncoder` with `sentence-transformers/all-MiniLM-L6-v2` (the default) as an embedding model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "5253c141-141b-4fda-b07c-a313393902ed",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/jamesbriggs/opt/anaconda3/envs/decision-layer/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n"
+     ]
+    }
+   ],
+   "source": [
+    "from semantic_router.encoders import HuggingFaceEncoder\n",
+    "\n",
+    "encoder = HuggingFaceEncoder()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "512fb46e-352b-4740-971e-ad4d047aa03b",
+   "metadata": {},
+   "source": [
+    "# `llama.cpp` LLM\n",
+    "\n",
+    "From here, we can go ahead and instantiate our `llama-cpp-python` `llama_cpp.Llama` LLM, and then pass it to the `semantic_router.llms.LlamaCppLLM` wrapper class.\n",
+    "\n",
+    "For `llama_cpp.Llama`, there are a couple of parameters you should pay attention to:\n",
+    "\n",
+    "- `n_gpu_layers`: how many LLM layers to offload to the GPU (if you want to offload the entire model, pass `-1`, and for CPU execution, pass `0`)\n",
+    "- `n_ctx`: context size, limit the number of tokens that can be passed to the LLM (this is bounded by the model's internal maximum context size, in this case for Mistral-7B-Instruct, 8000 tokens)\n",
+    "- `verbose`: if `False`, silences output from `llama.cpp`\n",
+    "\n",
+    "> For other parameter explanation, refer to the `llama-cpp-python` [API Reference](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "772cec0d-7a0c-4c7e-9b7a-4a1864b0a8ec",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ./mistral-7b-instruct-v0.2.Q4_0.gguf (version GGUF V3 (latest))\n",
+      "llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.\n",
+      "llama_model_loader: - kv   0:                       general.architecture str              = llama\n",
+      "llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2\n",
+      "llama_model_loader: - kv   2:                       llama.context_length u32              = 32768\n",
+      "llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096\n",
+      "llama_model_loader: - kv   4:                          llama.block_count u32              = 32\n",
+      "llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336\n",
+      "llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128\n",
+      "llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32\n",
+      "llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8\n",
+      "llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010\n",
+      "llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000\n",
+      "llama_model_loader: - kv  11:                          general.file_type u32              = 2\n",
+      "llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama\n",
+      "llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = [\"<unk>\", \"<s>\", \"</s>\", \"<0x00>\", \"<...\n",
+      "llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...\n",
+      "llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...\n",
+      "llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1\n",
+      "llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2\n",
+      "llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0\n",
+      "llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0\n",
+      "llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true\n",
+      "llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false\n",
+      "llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...\n",
+      "llama_model_loader: - kv  23:               general.quantization_version u32              = 2\n",
+      "llama_model_loader: - type  f32:   65 tensors\n",
+      "llama_model_loader: - type q4_0:  225 tensors\n",
+      "llama_model_loader: - type q6_K:    1 tensors\n",
+      "llm_load_vocab: special tokens definition check successful ( 259/32000 ).\n",
+      "llm_load_print_meta: format           = GGUF V3 (latest)\n",
+      "llm_load_print_meta: arch             = llama\n",
+      "llm_load_print_meta: vocab type       = SPM\n",
+      "llm_load_print_meta: n_vocab          = 32000\n",
+      "llm_load_print_meta: n_merges         = 0\n",
+      "llm_load_print_meta: n_ctx_train      = 32768\n",
+      "llm_load_print_meta: n_embd           = 4096\n",
+      "llm_load_print_meta: n_head           = 32\n",
+      "llm_load_print_meta: n_head_kv        = 8\n",
+      "llm_load_print_meta: n_layer          = 32\n",
+      "llm_load_print_meta: n_rot            = 128\n",
+      "llm_load_print_meta: n_embd_head_k    = 128\n",
+      "llm_load_print_meta: n_embd_head_v    = 128\n",
+      "llm_load_print_meta: n_gqa            = 4\n",
+      "llm_load_print_meta: n_embd_k_gqa     = 1024\n",
+      "llm_load_print_meta: n_embd_v_gqa     = 1024\n",
+      "llm_load_print_meta: f_norm_eps       = 0.0e+00\n",
+      "llm_load_print_meta: f_norm_rms_eps   = 1.0e-05\n",
+      "llm_load_print_meta: f_clamp_kqv      = 0.0e+00\n",
+      "llm_load_print_meta: f_max_alibi_bias = 0.0e+00\n",
+      "llm_load_print_meta: n_ff             = 14336\n",
+      "llm_load_print_meta: n_expert         = 0\n",
+      "llm_load_print_meta: n_expert_used    = 0\n",
+      "llm_load_print_meta: rope scaling     = linear\n",
+      "llm_load_print_meta: freq_base_train  = 1000000.0\n",
+      "llm_load_print_meta: freq_scale_train = 1\n",
+      "llm_load_print_meta: n_yarn_orig_ctx  = 32768\n",
+      "llm_load_print_meta: rope_finetuned   = unknown\n",
+      "llm_load_print_meta: model type       = 7B\n",
+      "llm_load_print_meta: model ftype      = Q4_0\n",
+      "llm_load_print_meta: model params     = 7.24 B\n",
+      "llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) \n",
+      "llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2\n",
+      "llm_load_print_meta: BOS token        = 1 '<s>'\n",
+      "llm_load_print_meta: EOS token        = 2 '</s>'\n",
+      "llm_load_print_meta: UNK token        = 0 '<unk>'\n",
+      "llm_load_print_meta: PAD token        = 0 '<unk>'\n",
+      "llm_load_print_meta: LF token         = 13 '<0x0A>'\n",
+      "llm_load_tensors: ggml ctx size       =    0.11 MiB\n",
+      "ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  3918.58 MiB, ( 3918.64 / 21845.34)\n",
+      "llm_load_tensors: system memory used  = 3917.98 MiB\n",
+      "..................................................................................................\n",
+      "llama_new_context_with_model: n_ctx      = 2048\n",
+      "llama_new_context_with_model: freq_base  = 1000000.0\n",
+      "llama_new_context_with_model: freq_scale = 1\n",
+      "ggml_metal_init: allocating\n",
+      "ggml_metal_init: found device: Apple M1 Max\n",
+      "ggml_metal_init: picking default device: Apple M1 Max\n",
+      "ggml_metal_init: default.metallib not found, loading from source\n",
+      "ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil\n",
+      "ggml_metal_init: loading '/Users/jamesbriggs/opt/anaconda3/envs/decision-layer/lib/python3.11/site-packages/llama_cpp/ggml-metal.metal'\n",
+      "ggml_metal_init: GPU name:   Apple M1 Max\n",
+      "ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)\n",
+      "ggml_metal_init: hasUnifiedMemory              = true\n",
+      "ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB\n",
+      "ggml_metal_init: maxTransferRate               = built-in GPU\n",
+      "ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   256.00 MiB, ( 4176.20 / 21845.34)\n",
+      "llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB\n",
+      "ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 4176.22 / 21845.34)\n",
+      "llama_build_graph: non-view tensors processed: 676/676\n",
+      "llama_new_context_with_model: compute buffer total size = 159.19 MiB\n",
+      "ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   156.02 MiB, ( 4332.22 / 21845.34)\n",
+      "\u001b[32m2024-01-13 16:40:52 INFO semantic_router.utils.logger Initializing RouteLayer\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "from semantic_router import RouteLayer\n",
+    "\n",
+    "from llama_cpp import Llama\n",
+    "from semantic_router.llms import LlamaCppLLM\n",
+    "\n",
+    "enable_gpu = True  # offload LLM layers to the GPU (must fit in memory)\n",
+    "\n",
+    "_llm = Llama(\n",
+    "    model_path=\"./mistral-7b-instruct-v0.2.Q4_0.gguf\",\n",
+    "    n_gpu_layers=-1 if enable_gpu else 0,\n",
+    "    n_ctx=2048,\n",
+    "    verbose=False,\n",
+    ")\n",
+    "llm = LlamaCppLLM(name=\"Mistral-7B-v0.2-Instruct\", llm=_llm, max_tokens=None)\n",
+    "\n",
+    "rl = RouteLayer(encoder=encoder, routes=routes, llm=llm)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "a8bd1da4-8ff7-4cd3-a5e3-fd79a938cc67",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "RouteChoice(name='chitchat', function_call=None, similarity_score=None, trigger=None)"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "rl(\"how's the weather today?\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "c6ccbea2-376b-4b28-9b79-d2e9c71e99f4",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "from_string grammar:\n",
+      "root ::= object \n",
+      "object ::= [{] ws object_11 [}] ws \n",
+      "value ::= object | array | string | number | value_6 ws \n",
+      "array ::= [[] ws array_15 []] ws \n",
+      "string ::= [\"] string_18 [\"] ws \n",
+      "number ::= number_19 number_25 number_29 ws \n",
+      "value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l] \n",
+      "ws ::= ws_31 \n",
+      "object_8 ::= string [:] ws value object_10 \n",
+      "object_9 ::= [,] ws string [:] ws value \n",
+      "object_10 ::= object_9 object_10 | \n",
+      "object_11 ::= object_8 | \n",
+      "array_12 ::= value array_14 \n",
+      "array_13 ::= [,] ws value \n",
+      "array_14 ::= array_13 array_14 | \n",
+      "array_15 ::= array_12 | \n",
+      "string_16 ::= [^\"\\] | [\\] string_17 \n",
+      "string_17 ::= [\"\\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] \n",
+      "string_18 ::= string_16 string_18 | \n",
+      "number_19 ::= number_20 number_21 \n",
+      "number_20 ::= [-] | \n",
+      "number_21 ::= [0-9] | [1-9] number_22 \n",
+      "number_22 ::= [0-9] number_22 | \n",
+      "number_23 ::= [.] number_24 \n",
+      "number_24 ::= [0-9] number_24 | [0-9] \n",
+      "number_25 ::= number_23 | \n",
+      "number_26 ::= [eE] number_27 number_28 \n",
+      "number_27 ::= [-+] | \n",
+      "number_28 ::= [0-9] number_28 | [0-9] \n",
+      "number_29 ::= number_26 | \n",
+      "ws_30 ::= [ <U+0009><U+000A>] ws \n",
+      "ws_31 ::= ws_30 | \n",
+      "\n",
+      "\u001b[32m2024-01-13 16:41:01 INFO semantic_router.utils.logger Extracting function input...\u001b[0m\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "name='get_time' function_call={'timezone': 'America/New_York'} similarity_score=None trigger=None\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'11:41'"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "out = rl(\"what's the time in New York right now?\")\n",
+    "print(out)\n",
+    "get_time(**out.function_call)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "720f976a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "from_string grammar:\n",
+      "root ::= object \n",
+      "object ::= [{] ws object_11 [}] ws \n",
+      "value ::= object | array | string | number | value_6 ws \n",
+      "array ::= [[] ws array_15 []] ws \n",
+      "string ::= [\"] string_18 [\"] ws \n",
+      "number ::= number_19 number_25 number_29 ws \n",
+      "value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l] \n",
+      "ws ::= ws_31 \n",
+      "object_8 ::= string [:] ws value object_10 \n",
+      "object_9 ::= [,] ws string [:] ws value \n",
+      "object_10 ::= object_9 object_10 | \n",
+      "object_11 ::= object_8 | \n",
+      "array_12 ::= value array_14 \n",
+      "array_13 ::= [,] ws value \n",
+      "array_14 ::= array_13 array_14 | \n",
+      "array_15 ::= array_12 | \n",
+      "string_16 ::= [^\"\\] | [\\] string_17 \n",
+      "string_17 ::= [\"\\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] \n",
+      "string_18 ::= string_16 string_18 | \n",
+      "number_19 ::= number_20 number_21 \n",
+      "number_20 ::= [-] | \n",
+      "number_21 ::= [0-9] | [1-9] number_22 \n",
+      "number_22 ::= [0-9] number_22 | \n",
+      "number_23 ::= [.] number_24 \n",
+      "number_24 ::= [0-9] number_24 | [0-9] \n",
+      "number_25 ::= number_23 | \n",
+      "number_26 ::= [eE] number_27 number_28 \n",
+      "number_27 ::= [-+] | \n",
+      "number_28 ::= [0-9] number_28 | [0-9] \n",
+      "number_29 ::= number_26 | \n",
+      "ws_30 ::= [ <U+0009><U+000A>] ws \n",
+      "ws_31 ::= ws_30 | \n",
+      "\n",
+      "\u001b[32m2024-01-13 16:41:04 INFO semantic_router.utils.logger Extracting function input...\u001b[0m\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "name='get_time' function_call={'timezone': 'Europe/Rome'} similarity_score=None trigger=None\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'17:41'"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "out = rl(\"what's the time in Rome right now?\")\n",
+    "print(out)\n",
+    "get_time(**out.function_call)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "c9d9dbbb",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "from_string grammar:\n",
+      "root ::= object \n",
+      "object ::= [{] ws object_11 [}] ws \n",
+      "value ::= object | array | string | number | value_6 ws \n",
+      "array ::= [[] ws array_15 []] ws \n",
+      "string ::= [\"] string_18 [\"] ws \n",
+      "number ::= number_19 number_25 number_29 ws \n",
+      "value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l] \n",
+      "ws ::= ws_31 \n",
+      "object_8 ::= string [:] ws value object_10 \n",
+      "object_9 ::= [,] ws string [:] ws value \n",
+      "object_10 ::= object_9 object_10 | \n",
+      "object_11 ::= object_8 | \n",
+      "array_12 ::= value array_14 \n",
+      "array_13 ::= [,] ws value \n",
+      "array_14 ::= array_13 array_14 | \n",
+      "array_15 ::= array_12 | \n",
+      "string_16 ::= [^\"\\] | [\\] string_17 \n",
+      "string_17 ::= [\"\\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] \n",
+      "string_18 ::= string_16 string_18 | \n",
+      "number_19 ::= number_20 number_21 \n",
+      "number_20 ::= [-] | \n",
+      "number_21 ::= [0-9] | [1-9] number_22 \n",
+      "number_22 ::= [0-9] number_22 | \n",
+      "number_23 ::= [.] number_24 \n",
+      "number_24 ::= [0-9] number_24 | [0-9] \n",
+      "number_25 ::= number_23 | \n",
+      "number_26 ::= [eE] number_27 number_28 \n",
+      "number_27 ::= [-+] | \n",
+      "number_28 ::= [0-9] number_28 | [0-9] \n",
+      "number_29 ::= number_26 | \n",
+      "ws_30 ::= [ <U+0009><U+000A>] ws \n",
+      "ws_31 ::= ws_30 | \n",
+      "\n",
+      "\u001b[32m2024-01-13 16:41:05 INFO semantic_router.utils.logger Extracting function input...\u001b[0m\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "name='get_time' function_call={'timezone': 'Asia/Bangkok'} similarity_score=None trigger=None\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'23:41'"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "out = rl(\"what's the time in Bangkok right now?\")\n",
+    "print(out)\n",
+    "get_time(**out.function_call)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "675d12fd",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "from_string grammar:\n",
+      "root ::= object \n",
+      "object ::= [{] ws object_11 [}] ws \n",
+      "value ::= object | array | string | number | value_6 ws \n",
+      "array ::= [[] ws array_15 []] ws \n",
+      "string ::= [\"] string_18 [\"] ws \n",
+      "number ::= number_19 number_25 number_29 ws \n",
+      "value_6 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] | [n] [u] [l] [l] \n",
+      "ws ::= ws_31 \n",
+      "object_8 ::= string [:] ws value object_10 \n",
+      "object_9 ::= [,] ws string [:] ws value \n",
+      "object_10 ::= object_9 object_10 | \n",
+      "object_11 ::= object_8 | \n",
+      "array_12 ::= value array_14 \n",
+      "array_13 ::= [,] ws value \n",
+      "array_14 ::= array_13 array_14 | \n",
+      "array_15 ::= array_12 | \n",
+      "string_16 ::= [^\"\\] | [\\] string_17 \n",
+      "string_17 ::= [\"\\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] \n",
+      "string_18 ::= string_16 string_18 | \n",
+      "number_19 ::= number_20 number_21 \n",
+      "number_20 ::= [-] | \n",
+      "number_21 ::= [0-9] | [1-9] number_22 \n",
+      "number_22 ::= [0-9] number_22 | \n",
+      "number_23 ::= [.] number_24 \n",
+      "number_24 ::= [0-9] number_24 | [0-9] \n",
+      "number_25 ::= number_23 | \n",
+      "number_26 ::= [eE] number_27 number_28 \n",
+      "number_27 ::= [-+] | \n",
+      "number_28 ::= [0-9] number_28 | [0-9] \n",
+      "number_29 ::= number_26 | \n",
+      "ws_30 ::= [ <U+0009><U+000A>] ws \n",
+      "ws_31 ::= ws_30 | \n",
+      "\n",
+      "\u001b[32m2024-01-13 16:41:07 INFO semantic_router.utils.logger Extracting function input...\u001b[0m\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "name='get_time' function_call={'timezone': 'Asia/Bangkok'} similarity_score=None trigger=None\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'23:41'"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "out = rl(\"what's the time in Phuket right now?\")\n",
+    "print(out)\n",
+    "get_time(**out.function_call)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5200f550-f3be-43d7-9b76-6390360f07c8",
+   "metadata": {},
+   "source": [
+    "## Cleanup"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "76df5f53",
+   "metadata": {},
+   "source": [
+    "Once done, if you'd like to delete the downloaded model you can do so with the following:\n",
+    "\n",
+    "```\n",
+    "! rm ./mistral-7b-instruct-v0.2.Q4_0.gguf\n",
+    "```"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/poetry.lock b/poetry.lock
index 96a21bbc54879c58239b7afab825ba958c34d213..c57d0a3de83b9cf3583cc72503c0f69c10cebf8e 100644
--- a/poetry.lock
+++ b/poetry.lock
@@ -440,13 +440,13 @@ colorama = {version = "*", markers = "platform_system == \"Windows\""}
 
 [[package]]
 name = "cohere"
-version = "4.41"
+version = "4.42"
 description = "Python SDK for the Cohere API"
 optional = false
 python-versions = ">=3.8,<4.0"
 files = [
-    {file = "cohere-4.41-py3-none-any.whl", hash = "sha256:39470cc412fa96a1c612f522d48d7d86b34b3163a04030cff83ec48ebbaff32f"},
-    {file = "cohere-4.41.tar.gz", hash = "sha256:8509ca196dc038eca81e474d3cd5896da2ea168a4d3c578b4cb6969994be34ef"},
+    {file = "cohere-4.42-py3-none-any.whl", hash = "sha256:47f9355de0b7628314f461ca009fa3460c7edd9fd42d07cb5439321c05ae5ff9"},
+    {file = "cohere-4.42.tar.gz", hash = "sha256:8b1b93be118c5fb236d008df64abc0687cf88b77d1b589ac8cc8cd0d5dadb04b"},
 ]
 
 [package.dependencies]
@@ -624,6 +624,17 @@ files = [
     {file = "decorator-5.1.1.tar.gz", hash = "sha256:637996211036b6385ef91435e4fae22989472f9d571faba8927ba8253acbc330"},
 ]
 
+[[package]]
+name = "diskcache"
+version = "5.6.3"
+description = "Disk Cache -- Disk and file backed persistent cache."
+optional = true
+python-versions = ">=3"
+files = [
+    {file = "diskcache-5.6.3-py3-none-any.whl", hash = "sha256:5e31b2d5fbad117cc363ebaf6b689474db18a1f6438bc82358b024abd4c2ca19"},
+    {file = "diskcache-5.6.3.tar.gz", hash = "sha256:2c3a3fa2743d8535d832ec61c2054a1641f41775aa7c556758a109941e33e4fc"},
+]
+
 [[package]]
 name = "distro"
 version = "1.9.0"
@@ -1125,13 +1136,13 @@ testing = ["Django", "attrs", "colorama", "docopt", "pytest (<7.0.0)"]
 
 [[package]]
 name = "jinja2"
-version = "3.1.2"
+version = "3.1.3"
 description = "A very fast and expressive template engine."
 optional = true
 python-versions = ">=3.7"
 files = [
-    {file = "Jinja2-3.1.2-py3-none-any.whl", hash = "sha256:6088930bfe239f0e6710546ab9c19c9ef35e29792895fed6e6e31a023a182a61"},
-    {file = "Jinja2-3.1.2.tar.gz", hash = "sha256:31351a702a408a9e7595a8fc6150fc3f43bb6bf7e319770cbc0db9df9437e852"},
+    {file = "Jinja2-3.1.3-py3-none-any.whl", hash = "sha256:7d6d50dd97d52cbc355597bd845fabfbac3f551e1f99619e39a35ce8c370b5fa"},
+    {file = "Jinja2-3.1.3.tar.gz", hash = "sha256:ac8bd6544d4bb2c9792bf3a159e80bba8fda7f07e81bc3aed565432d5925ba90"},
 ]
 
 [package.dependencies]
@@ -1194,6 +1205,27 @@ traitlets = ">=5.3"
 docs = ["myst-parser", "pydata-sphinx-theme", "sphinx-autodoc-typehints", "sphinxcontrib-github-alt", "sphinxcontrib-spelling", "traitlets"]
 test = ["ipykernel", "pre-commit", "pytest", "pytest-cov", "pytest-timeout"]
 
+[[package]]
+name = "llama-cpp-python"
+version = "0.2.28"
+description = "Python bindings for the llama.cpp library"
+optional = true
+python-versions = ">=3.8"
+files = [
+    {file = "llama_cpp_python-0.2.28.tar.gz", hash = "sha256:669885d9654fe27ed084061e23b0c2af5fcf5593aa3d5a159864e249f91e6d84"},
+]
+
+[package.dependencies]
+diskcache = ">=5.6.1"
+numpy = ">=1.20.0"
+typing-extensions = ">=4.5.0"
+
+[package.extras]
+all = ["llama_cpp_python[dev,server,test]"]
+dev = ["black (>=23.3.0)", "httpx (>=0.24.1)", "mkdocs (>=1.4.3)", "mkdocs-material (>=9.1.18)", "mkdocstrings[python] (>=0.22.0)", "pytest (>=7.4.0)", "twine (>=4.0.2)"]
+server = ["fastapi (>=0.100.0)", "pydantic-settings (>=2.0.1)", "sse-starlette (>=1.6.1)", "starlette-context (>=0.3.6,<0.4)", "uvicorn (>=0.22.0)"]
+test = ["httpx (>=0.24.1)", "pytest (>=7.4.0)", "scipy (>=1.10)"]
+
 [[package]]
 name = "markupsafe"
 version = "2.1.3"
@@ -1792,13 +1824,13 @@ sympy = "*"
 
 [[package]]
 name = "openai"
-version = "1.7.0"
+version = "1.7.2"
 description = "The official Python library for the openai API"
 optional = false
 python-versions = ">=3.7.1"
 files = [
-    {file = "openai-1.7.0-py3-none-any.whl", hash = "sha256:2282e8e15acb05df79cccba330c025b8e84284c7ec1f3fa31f167a8479066333"},
-    {file = "openai-1.7.0.tar.gz", hash = "sha256:f2a8dcb739e8620c9318a2c6304ea72aebb572ba02fa1d586344405e80d567d3"},
+    {file = "openai-1.7.2-py3-none-any.whl", hash = "sha256:8f41b90a762f5fd9d182b45851041386fed94c8ad240a70abefee61a68e0ef53"},
+    {file = "openai-1.7.2.tar.gz", hash = "sha256:c73c78878258b07f1b468b0602c6591f25a1478f49ecb90b9bd44b7cc80bce73"},
 ]
 
 [package.dependencies]
@@ -1933,22 +1965,22 @@ wcwidth = "*"
 
 [[package]]
 name = "protobuf"
-version = "4.25.1"
+version = "4.25.2"
 description = ""
 optional = true
 python-versions = ">=3.8"
 files = [
-    {file = "protobuf-4.25.1-cp310-abi3-win32.whl", hash = "sha256:193f50a6ab78a970c9b4f148e7c750cfde64f59815e86f686c22e26b4fe01ce7"},
-    {file = "protobuf-4.25.1-cp310-abi3-win_amd64.whl", hash = "sha256:3497c1af9f2526962f09329fd61a36566305e6c72da2590ae0d7d1322818843b"},
-    {file = "protobuf-4.25.1-cp37-abi3-macosx_10_9_universal2.whl", hash = "sha256:0bf384e75b92c42830c0a679b0cd4d6e2b36ae0cf3dbb1e1dfdda48a244f4bcd"},
-    {file = "protobuf-4.25.1-cp37-abi3-manylinux2014_aarch64.whl", hash = "sha256:0f881b589ff449bf0b931a711926e9ddaad3b35089cc039ce1af50b21a4ae8cb"},
-    {file = "protobuf-4.25.1-cp37-abi3-manylinux2014_x86_64.whl", hash = "sha256:ca37bf6a6d0046272c152eea90d2e4ef34593aaa32e8873fc14c16440f22d4b7"},
-    {file = "protobuf-4.25.1-cp38-cp38-win32.whl", hash = "sha256:abc0525ae2689a8000837729eef7883b9391cd6aa7950249dcf5a4ede230d5dd"},
-    {file = "protobuf-4.25.1-cp38-cp38-win_amd64.whl", hash = "sha256:1484f9e692091450e7edf418c939e15bfc8fc68856e36ce399aed6889dae8bb0"},
-    {file = "protobuf-4.25.1-cp39-cp39-win32.whl", hash = "sha256:8bdbeaddaac52d15c6dce38c71b03038ef7772b977847eb6d374fc86636fa510"},
-    {file = "protobuf-4.25.1-cp39-cp39-win_amd64.whl", hash = "sha256:becc576b7e6b553d22cbdf418686ee4daa443d7217999125c045ad56322dda10"},
-    {file = "protobuf-4.25.1-py3-none-any.whl", hash = "sha256:a19731d5e83ae4737bb2a089605e636077ac001d18781b3cf489b9546c7c80d6"},
-    {file = "protobuf-4.25.1.tar.gz", hash = "sha256:57d65074b4f5baa4ab5da1605c02be90ac20c8b40fb137d6a8df9f416b0d0ce2"},
+    {file = "protobuf-4.25.2-cp310-abi3-win32.whl", hash = "sha256:b50c949608682b12efb0b2717f53256f03636af5f60ac0c1d900df6213910fd6"},
+    {file = "protobuf-4.25.2-cp310-abi3-win_amd64.whl", hash = "sha256:8f62574857ee1de9f770baf04dde4165e30b15ad97ba03ceac65f760ff018ac9"},
+    {file = "protobuf-4.25.2-cp37-abi3-macosx_10_9_universal2.whl", hash = "sha256:2db9f8fa64fbdcdc93767d3cf81e0f2aef176284071507e3ede160811502fd3d"},
+    {file = "protobuf-4.25.2-cp37-abi3-manylinux2014_aarch64.whl", hash = "sha256:10894a2885b7175d3984f2be8d9850712c57d5e7587a2410720af8be56cdaf62"},
+    {file = "protobuf-4.25.2-cp37-abi3-manylinux2014_x86_64.whl", hash = "sha256:fc381d1dd0516343f1440019cedf08a7405f791cd49eef4ae1ea06520bc1c020"},
+    {file = "protobuf-4.25.2-cp38-cp38-win32.whl", hash = "sha256:33a1aeef4b1927431d1be780e87b641e322b88d654203a9e9d93f218ee359e61"},
+    {file = "protobuf-4.25.2-cp38-cp38-win_amd64.whl", hash = "sha256:47f3de503fe7c1245f6f03bea7e8d3ec11c6c4a2ea9ef910e3221c8a15516d62"},
+    {file = "protobuf-4.25.2-cp39-cp39-win32.whl", hash = "sha256:5e5c933b4c30a988b52e0b7c02641760a5ba046edc5e43d3b94a74c9fc57c1b3"},
+    {file = "protobuf-4.25.2-cp39-cp39-win_amd64.whl", hash = "sha256:d66a769b8d687df9024f2985d5137a337f957a0916cf5464d1513eee96a63ff0"},
+    {file = "protobuf-4.25.2-py3-none-any.whl", hash = "sha256:a8b7a98d4ce823303145bf3c1a8bdb0f2f4642a414b196f04ad9853ed0c8f830"},
+    {file = "protobuf-4.25.2.tar.gz", hash = "sha256:fe599e175cb347efc8ee524bcd4b902d11f7262c0e569ececcb89995c15f0a5e"},
 ]
 
 [[package]]
@@ -2496,28 +2528,28 @@ use-chardet-on-py3 = ["chardet (>=3.0.2,<6)"]
 
 [[package]]
 name = "ruff"
-version = "0.1.11"
+version = "0.1.13"
 description = "An extremely fast Python linter and code formatter, written in Rust."
 optional = false
 python-versions = ">=3.7"
 files = [
-    {file = "ruff-0.1.11-py3-none-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:a7f772696b4cdc0a3b2e527fc3c7ccc41cdcb98f5c80fdd4f2b8c50eb1458196"},
-    {file = "ruff-0.1.11-py3-none-macosx_10_12_x86_64.whl", hash = "sha256:934832f6ed9b34a7d5feea58972635c2039c7a3b434fe5ba2ce015064cb6e955"},
-    {file = "ruff-0.1.11-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ea0d3e950e394c4b332bcdd112aa566010a9f9c95814844a7468325290aabfd9"},
-    {file = "ruff-0.1.11-py3-none-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:9bd4025b9c5b429a48280785a2b71d479798a69f5c2919e7d274c5f4b32c3607"},
-    {file = "ruff-0.1.11-py3-none-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:e1ad00662305dcb1e987f5ec214d31f7d6a062cae3e74c1cbccef15afd96611d"},
-    {file = "ruff-0.1.11-py3-none-manylinux_2_17_ppc64.manylinux2014_ppc64.whl", hash = "sha256:4b077ce83f47dd6bea1991af08b140e8b8339f0ba8cb9b7a484c30ebab18a23f"},
-    {file = "ruff-0.1.11-py3-none-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:c4a88efecec23c37b11076fe676e15c6cdb1271a38f2b415e381e87fe4517f18"},
-    {file = "ruff-0.1.11-py3-none-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5b25093dad3b055667730a9b491129c42d45e11cdb7043b702e97125bcec48a1"},
-    {file = "ruff-0.1.11-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:231d8fb11b2cc7c0366a326a66dafc6ad449d7fcdbc268497ee47e1334f66f77"},
-    {file = "ruff-0.1.11-py3-none-musllinux_1_2_aarch64.whl", hash = "sha256:09c415716884950080921dd6237767e52e227e397e2008e2bed410117679975b"},
-    {file = "ruff-0.1.11-py3-none-musllinux_1_2_armv7l.whl", hash = "sha256:0f58948c6d212a6b8d41cd59e349751018797ce1727f961c2fa755ad6208ba45"},
-    {file = "ruff-0.1.11-py3-none-musllinux_1_2_i686.whl", hash = "sha256:190a566c8f766c37074d99640cd9ca3da11d8deae2deae7c9505e68a4a30f740"},
-    {file = "ruff-0.1.11-py3-none-musllinux_1_2_x86_64.whl", hash = "sha256:6464289bd67b2344d2a5d9158d5eb81025258f169e69a46b741b396ffb0cda95"},
-    {file = "ruff-0.1.11-py3-none-win32.whl", hash = "sha256:9b8f397902f92bc2e70fb6bebfa2139008dc72ae5177e66c383fa5426cb0bf2c"},
-    {file = "ruff-0.1.11-py3-none-win_amd64.whl", hash = "sha256:eb85ee287b11f901037a6683b2374bb0ec82928c5cbc984f575d0437979c521a"},
-    {file = "ruff-0.1.11-py3-none-win_arm64.whl", hash = "sha256:97ce4d752f964ba559c7023a86e5f8e97f026d511e48013987623915431c7ea9"},
-    {file = "ruff-0.1.11.tar.gz", hash = "sha256:f9d4d88cb6eeb4dfe20f9f0519bd2eaba8119bde87c3d5065c541dbae2b5a2cb"},
+    {file = "ruff-0.1.13-py3-none-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:e3fd36e0d48aeac672aa850045e784673449ce619afc12823ea7868fcc41d8ba"},
+    {file = "ruff-0.1.13-py3-none-macosx_10_12_x86_64.whl", hash = "sha256:9fb6b3b86450d4ec6a6732f9f60c4406061b6851c4b29f944f8c9d91c3611c7a"},
+    {file = "ruff-0.1.13-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b13ba5d7156daaf3fd08b6b993360a96060500aca7e307d95ecbc5bb47a69296"},
+    {file = "ruff-0.1.13-py3-none-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:9ebb40442f7b531e136d334ef0851412410061e65d61ca8ce90d894a094feb22"},
+    {file = "ruff-0.1.13-py3-none-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:226b517f42d59a543d6383cfe03cccf0091e3e0ed1b856c6824be03d2a75d3b6"},
+    {file = "ruff-0.1.13-py3-none-manylinux_2_17_ppc64.manylinux2014_ppc64.whl", hash = "sha256:5f0312ba1061e9b8c724e9a702d3c8621e3c6e6c2c9bd862550ab2951ac75c16"},
+    {file = "ruff-0.1.13-py3-none-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:2f59bcf5217c661254bd6bc42d65a6fd1a8b80c48763cb5c2293295babd945dd"},
+    {file = "ruff-0.1.13-py3-none-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:e6894b00495e00c27b6ba61af1fc666f17de6140345e5ef27dd6e08fb987259d"},
+    {file = "ruff-0.1.13-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9a1600942485c6e66119da294c6294856b5c86fd6df591ce293e4a4cc8e72989"},
+    {file = "ruff-0.1.13-py3-none-musllinux_1_2_aarch64.whl", hash = "sha256:ee3febce7863e231a467f90e681d3d89210b900d49ce88723ce052c8761be8c7"},
+    {file = "ruff-0.1.13-py3-none-musllinux_1_2_armv7l.whl", hash = "sha256:dcaab50e278ff497ee4d1fe69b29ca0a9a47cd954bb17963628fa417933c6eb1"},
+    {file = "ruff-0.1.13-py3-none-musllinux_1_2_i686.whl", hash = "sha256:f57de973de4edef3ad3044d6a50c02ad9fc2dff0d88587f25f1a48e3f72edf5e"},
+    {file = "ruff-0.1.13-py3-none-musllinux_1_2_x86_64.whl", hash = "sha256:7a36fa90eb12208272a858475ec43ac811ac37e91ef868759770b71bdabe27b6"},
+    {file = "ruff-0.1.13-py3-none-win32.whl", hash = "sha256:a623349a505ff768dad6bd57087e2461be8db58305ebd5577bd0e98631f9ae69"},
+    {file = "ruff-0.1.13-py3-none-win_amd64.whl", hash = "sha256:f988746e3c3982bea7f824c8fa318ce7f538c4dfefec99cd09c8770bd33e6539"},
+    {file = "ruff-0.1.13-py3-none-win_arm64.whl", hash = "sha256:6bbbc3042075871ec17f28864808540a26f0f79a4478c357d3e3d2284e832998"},
+    {file = "ruff-0.1.13.tar.gz", hash = "sha256:e261f1baed6291f434ffb1d5c6bd8051d1c2a26958072d38dfbec39b3dda7352"},
 ]
 
 [[package]]
@@ -3212,9 +3244,9 @@ testing = ["big-O", "jaraco.functools", "jaraco.itertools", "more-itertools", "p
 [extras]
 fastembed = ["fastembed"]
 hybrid = ["pinecone-text"]
-local = ["torch", "transformers"]
+local = ["llama-cpp-python", "torch", "transformers"]
 
 [metadata]
 lock-version = "2.0"
 python-versions = "^3.9"
-content-hash = "5b459c6820bcf5c2b73daf0ecfcbbac95019311c74d88634bd7188650e48b749"
+content-hash = "1de69e2e5050507790405e09d4cd79fe114b4200a56c87cc609a104366696989"
diff --git a/pyproject.toml b/pyproject.toml
index 45f105fd9ac63138f7b5d4d82b6300d1052f2ccb..07536a512b342f58d2a0dc77933cdaa3c0f321f6 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -25,11 +25,12 @@ pinecone-text = {version = "^0.7.1", optional = true}
 fastembed = {version = "^0.1.3", optional = true, python = "<3.12"}
 torch = {version = "^2.1.2", optional = true}
 transformers = {version = "^4.36.2", optional = true}
+llama-cpp-python = {version = "^0.2.28", optional = true}
 
 [tool.poetry.extras]
 hybrid = ["pinecone-text"]
 fastembed = ["fastembed"]
-local = ["torch", "transformers"]
+local = ["torch", "transformers", "llama-cpp-python"]
 
 [tool.poetry.group.dev.dependencies]
 ipykernel = "^6.25.0"
diff --git a/semantic_router/llms/__init__.py b/semantic_router/llms/__init__.py
index e5aedc85fd30cc0b576fc2170c1b7ca694bdf200..02b3fd5b2422e718fcdf9fd4b34e4ace7fb3d957 100644
--- a/semantic_router/llms/__init__.py
+++ b/semantic_router/llms/__init__.py
@@ -1,6 +1,7 @@
 from semantic_router.llms.base import BaseLLM
 from semantic_router.llms.cohere import CohereLLM
+from semantic_router.llms.llamacpp import LlamaCppLLM
 from semantic_router.llms.openai import OpenAILLM
 from semantic_router.llms.openrouter import OpenRouterLLM
 
-__all__ = ["BaseLLM", "OpenAILLM", "OpenRouterLLM", "CohereLLM"]
+__all__ = ["BaseLLM", "OpenAILLM", "OpenRouterLLM", "CohereLLM", "LlamaCppLLM"]
diff --git a/semantic_router/llms/base.py b/semantic_router/llms/base.py
index 2560261173e61bac8f7769b12028261e812b1327..4fd2c3893e1a257dc9b64f35d8b6f19f7314f087 100644
--- a/semantic_router/llms/base.py
+++ b/semantic_router/llms/base.py
@@ -1,8 +1,10 @@
-from typing import List, Optional
+import json
+from typing import Any, List, Optional
 
 from pydantic import BaseModel
 
 from semantic_router.schema import Message
+from semantic_router.utils.logger import logger
 
 
 class BaseLLM(BaseModel):
@@ -11,5 +13,75 @@ class BaseLLM(BaseModel):
     class Config:
         arbitrary_types_allowed = True
 
+    def __init__(self, name: str, **kwargs):
+        super().__init__(name=name, **kwargs)
+
     def __call__(self, messages: List[Message]) -> Optional[str]:
         raise NotImplementedError("Subclasses must implement this method")
+
+    def _is_valid_inputs(
+        self, inputs: dict[str, Any], function_schema: dict[str, Any]
+    ) -> bool:
+        """Validate the extracted inputs against the function schema"""
+        try:
+            # Extract parameter names and types from the signature string
+            signature = function_schema["signature"]
+            param_info = [param.strip() for param in signature[1:-1].split(",")]
+            param_names = [info.split(":")[0].strip() for info in param_info]
+            param_types = [
+                info.split(":")[1].strip().split("=")[0].strip() for info in param_info
+            ]
+
+            for name, type_str in zip(param_names, param_types):
+                if name not in inputs:
+                    logger.error(f"Input {name} missing from query")
+                    return False
+            return True
+        except Exception as e:
+            logger.error(f"Input validation error: {str(e)}")
+            return False
+
+    def extract_function_inputs(
+        self, query: str, function_schema: dict[str, Any]
+    ) -> dict:
+        logger.info("Extracting function input...")
+
+        prompt = f"""
+        You are a helpful assistant designed to output JSON.
+        Given the following function schema
+        << {function_schema} >>
+        and query
+        << {query} >>
+        extract the parameters values from the query, in a valid JSON format.
+        Example:
+        Input:
+        query: "How is the weather in Hawaii right now in International units?"
+        schema:
+        {{
+            "name": "get_weather",
+            "description": "Useful to get the weather in a specific location",
+            "signature": "(location: str, degree: str) -> str",
+            "output": "<class 'str'>",
+        }}
+
+        Result: {{
+            "location": "London",
+            "degree": "Celsius",
+        }}
+
+        Input:
+        query: {query}
+        schema: {function_schema}
+        Result:
+        """
+        llm_input = [Message(role="user", content=prompt)]
+        output = self(llm_input)
+        if not output:
+            raise Exception("No output generated for extract function input")
+
+        output = output.replace("'", '"').strip().rstrip(",")
+
+        function_inputs = json.loads(output)
+        if not self._is_valid_inputs(function_inputs, function_schema):
+            raise ValueError("Invalid inputs")
+        return function_inputs
diff --git a/semantic_router/llms/grammars/json.gbnf b/semantic_router/llms/grammars/json.gbnf
new file mode 100644
index 0000000000000000000000000000000000000000..a9537cdf9fbe49c79967090eab759973c52f2136
--- /dev/null
+++ b/semantic_router/llms/grammars/json.gbnf
@@ -0,0 +1,25 @@
+root   ::= object
+value  ::= object | array | string | number | ("true" | "false" | "null") ws
+
+object ::=
+  "{" ws (
+            string ":" ws value
+    ("," ws string ":" ws value)*
+  )? "}" ws
+
+array  ::=
+  "[" ws (
+            value
+    ("," ws value)*
+  )? "]" ws
+
+string ::=
+  "\"" (
+    [^"\\] |
+    "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]) # escapes
+  )* "\"" ws
+
+number ::= ("-"? ([0-9] | [1-9] [0-9]*)) ("." [0-9]+)? ([eE] [-+]? [0-9]+)? ws
+
+# Optional space: by convention, applied in this grammar after literal chars when allowed
+ws ::= ([ \t\n] ws)?
diff --git a/semantic_router/llms/llamacpp.py b/semantic_router/llms/llamacpp.py
new file mode 100644
index 0000000000000000000000000000000000000000..2586d2e4253e485445c9c5e5bc1b3b81061c8279
--- /dev/null
+++ b/semantic_router/llms/llamacpp.py
@@ -0,0 +1,76 @@
+from contextlib import contextmanager
+from pathlib import Path
+from typing import Any, Optional
+
+from llama_cpp import Llama, LlamaGrammar
+
+from semantic_router.llms.base import BaseLLM
+from semantic_router.schema import Message
+from semantic_router.utils.logger import logger
+
+
+class LlamaCppLLM(BaseLLM):
+    llm: Llama
+    temperature: float
+    max_tokens: Optional[int] = 200
+    grammar: Optional[LlamaGrammar] = None
+
+    def __init__(
+        self,
+        llm: Llama,
+        name: str = "llama.cpp",
+        temperature: float = 0.2,
+        max_tokens: Optional[int] = 200,
+        grammar: Optional[LlamaGrammar] = None,
+    ):
+        super().__init__(
+            name=name,
+            llm=llm,
+            temperature=temperature,
+            max_tokens=max_tokens,
+            grammar=grammar,
+        )
+        self.llm = llm
+        self.temperature = temperature
+        self.max_tokens = max_tokens
+        self.grammar = grammar
+
+    def __call__(
+        self,
+        messages: list[Message],
+    ) -> str:
+        try:
+            completion = self.llm.create_chat_completion(
+                messages=[m.to_llamacpp() for m in messages],
+                temperature=self.temperature,
+                max_tokens=self.max_tokens,
+                grammar=self.grammar,
+                stream=False,
+            )
+            assert isinstance(completion, dict)  # keep mypy happy
+            output = completion["choices"][0]["message"]["content"]
+
+            if not output:
+                raise Exception("No output generated")
+            return output
+        except Exception as e:
+            logger.error(f"LLM error: {e}")
+            raise
+
+    @contextmanager
+    def _grammar(self):
+        grammar_path = Path(__file__).parent.joinpath("grammars", "json.gbnf")
+        assert grammar_path.exists(), f"{grammar_path}\ndoes not exist"
+        try:
+            self.grammar = LlamaGrammar.from_file(grammar_path)
+            yield
+        finally:
+            self.grammar = None
+
+    def extract_function_inputs(
+        self, query: str, function_schema: dict[str, Any]
+    ) -> dict:
+        with self._grammar():
+            return super().extract_function_inputs(
+                query=query, function_schema=function_schema
+            )
diff --git a/semantic_router/route.py b/semantic_router/route.py
index 3934d64fb700c3b61606fd7929fcbe85e9ba56e5..b3f36b8beb63d36a57e3ccc29d2daa0ecf98352b 100644
--- a/semantic_router/route.py
+++ b/semantic_router/route.py
@@ -53,8 +53,8 @@ class Route(BaseModel):
                     "attribute is set."
                 )
             # if a function schema is provided we generate the inputs
-            extracted_inputs = function_call.extract_function_inputs(
-                query=query, llm=self.llm, function_schema=self.function_schema
+            extracted_inputs = self.llm.extract_function_inputs(
+                query=query, function_schema=self.function_schema
             )
             func_call = extracted_inputs
         else:
diff --git a/semantic_router/schema.py b/semantic_router/schema.py
index 7529750df20999e767b50b21517a297994ee75ca..e4825999f37136f1c99ef7e89c8e69dd633d3ce2 100644
--- a/semantic_router/schema.py
+++ b/semantic_router/schema.py
@@ -63,6 +63,9 @@ class Message(BaseModel):
     def to_cohere(self):
         return {"role": self.role, "message": self.content}
 
+    def to_llamacpp(self):
+        return {"role": self.role, "content": self.content}
+
 
 class Conversation(BaseModel):
     messages: List[Message]
diff --git a/semantic_router/utils/function_call.py b/semantic_router/utils/function_call.py
index fd009c40f1ca96ecd1c709f4af9b1bb4f13e68c9..4a317852945f3287843bb84af770b91f364aac98 100644
--- a/semantic_router/utils/function_call.py
+++ b/semantic_router/utils/function_call.py
@@ -1,5 +1,4 @@
 import inspect
-import json
 from typing import Any, Callable, Dict, List, Union
 
 from pydantic import BaseModel
@@ -41,76 +40,6 @@ def get_schema(item: Union[BaseModel, Callable]) -> Dict[str, Any]:
     return schema
 
 
-def extract_function_inputs(
-    query: str, llm: BaseLLM, function_schema: Dict[str, Any]
-) -> Dict[str, Any]:
-    logger.info("Extracting function input...")
-
-    prompt = f"""
-You are a helpful assistant designed to output JSON.
-Given the following function schema
-<< {function_schema} >>
-and query
-<< {query} >>
-extract the parameters values from the query, in a valid JSON format.
-Example:
-Input:
-query: "How is the weather in Hawaii right now in International units?"
-schema:
-{{
-    "name": "get_weather",
-    "description": "Useful to get the weather in a specific location",
-    "signature": "(location: str, degree: str) -> float",
-    "output": "<class 'float'>",
-}}
-
-Result:
-{{
-    "location": "Hawaii",
-    "degree": "Kelvin",
-}}
-
-Input:
-query: \"{query}\"
-schema:
-{json.dumps(function_schema, indent=4)}
-
-Result:
-"""
-    llm_input = [Message(role="user", content=prompt)]
-    output = llm(llm_input)
-    if not output:
-        raise Exception("No output generated for extract function input")
-
-    output = output.replace("'", '"').strip().rstrip(",")
-
-    function_inputs = json.loads(output)
-    if not is_valid_inputs(function_inputs, function_schema):
-        raise ValueError("Invalid inputs")
-    return function_inputs
-
-
-def is_valid_inputs(inputs: Dict[str, Any], function_schema: Dict[str, Any]) -> bool:
-    """Validate the extracted inputs against the function schema"""
-    try:
-        # Extract parameter names and types from the signature string
-        signature = function_schema["signature"]
-        param_info = [param.strip() for param in signature[1:-1].split(",")]
-        param_names = [info.split(":")[0].strip() for info in param_info]
-        param_types = [
-            info.split(":")[1].strip().split("=")[0].strip() for info in param_info
-        ]
-
-        for name, type_str in zip(param_names, param_types):
-            if name not in inputs:
-                logger.error(f"Input {name} missing from query")
-                return False
-        return True
-    except Exception as e:
-        logger.error(f"Input validation error: {str(e)}")
-        return False
-
-
 # TODO: Add route layer object to the input, solve circular import issue
 async def route_and_execute(
     query: str, llm: BaseLLM, functions: List[Callable], layer
diff --git a/tests/unit/llms/test_llm_base.py b/tests/unit/llms/test_llm_base.py
index df78d8f54d881374a945eb32bbd692eb79a626a8..2208928a107575e8e5bc5306fc92b534713365f8 100644
--- a/tests/unit/llms/test_llm_base.py
+++ b/tests/unit/llms/test_llm_base.py
@@ -14,3 +14,58 @@ class TestBaseLLM:
     def test_base_llm_call_method_not_implemented(self, base_llm):
         with pytest.raises(NotImplementedError):
             base_llm("test")
+
+    def test_base_llm_is_valid_inputs_valid_input_pass(self, base_llm):
+        test_schema = {
+            "name": "get_time",
+            "description": 'Finds the current time in a specific timezone.\n\n:param timezone: The timezone to find the current time in, should\n    be a valid timezone from the IANA Time Zone Database like\n    "America/New_York" or "Europe/London". Do NOT put the place\n    name itself like "rome", or "new york", you must provide\n    the IANA format.\n:type timezone: str\n:return: The current time in the specified timezone.',
+            "signature": "(timezone: str) -> str",
+            "output": "<class 'str'>",
+        }
+        test_inputs = {"timezone": "America/New_York"}
+
+        assert base_llm._is_valid_inputs(test_inputs, test_schema) is True
+
+    @pytest.mark.skip(reason="TODO: bug in is_valid_inputs")
+    def test_base_llm_is_valid_inputs_valid_input_fail(self, base_llm):
+        test_schema = {
+            "name": "get_time",
+            "description": 'Finds the current time in a specific timezone.\n\n:param timezone: The timezone to find the current time in, should\n    be a valid timezone from the IANA Time Zone Database like\n    "America/New_York" or "Europe/London". Do NOT put the place\n    name itself like "rome", or "new york", you must provide\n    the IANA format.\n:type timezone: str\n:return: The current time in the specified timezone.',
+            "signature": "(timezone: str) -> str",
+            "output": "<class 'str'>",
+        }
+        test_inputs = {"timezone": None}
+
+        assert base_llm._is_valid_inputs(test_inputs, test_schema) is False
+
+    def test_base_llm_is_valid_inputs_invalid_false(self, base_llm):
+        test_schema = {
+            "name": "get_time",
+            "description": 'Finds the current time in a specific timezone.\n\n:param timezone: The timezone to find the current time in, should\n    be a valid timezone from the IANA Time Zone Database like\n    "America/New_York" or "Europe/London". Do NOT put the place\n    name itself like "rome", or "new york", you must provide\n    the IANA format.\n:type timezone: str\n:return: The current time in the specified timezone.',
+        }
+        test_inputs = {"timezone": "America/New_York"}
+
+        assert base_llm._is_valid_inputs(test_inputs, test_schema) is False
+
+    def test_base_llm_extract_function_inputs(self, base_llm):
+        with pytest.raises(NotImplementedError):
+            test_schema = {
+                "name": "get_time",
+                "description": 'Finds the current time in a specific timezone.\n\n:param timezone: The timezone to find the current time in, should\n    be a valid timezone from the IANA Time Zone Database like\n    "America/New_York" or "Europe/London". Do NOT put the place\n    name itself like "rome", or "new york", you must provide\n    the IANA format.\n:type timezone: str\n:return: The current time in the specified timezone.',
+                "signature": "(timezone: str) -> str",
+                "output": "<class 'str'>",
+            }
+            test_query = "What time is it in America/New_York?"
+            base_llm.extract_function_inputs(test_schema, test_query)
+
+    def test_base_llm_extract_function_inputs_no_output(self, base_llm, mocker):
+        with pytest.raises(Exception):
+            base_llm.output = mocker.Mock(return_value=None)
+            test_schema = {
+                "name": "get_time",
+                "description": 'Finds the current time in a specific timezone.\n\n:param timezone: The timezone to find the current time in, should\n    be a valid timezone from the IANA Time Zone Database like\n    "America/New_York" or "Europe/London". Do NOT put the place\n    name itself like "rome", or "new york", you must provide\n    the IANA format.\n:type timezone: str\n:return: The current time in the specified timezone.',
+                "signature": "(timezone: str) -> str",
+                "output": "<class 'str'>",
+            }
+            test_query = "What time is it in America/New_York?"
+            base_llm.extract_function_inputs(test_schema, test_query)
diff --git a/tests/unit/llms/test_llm_llamacpp.py b/tests/unit/llms/test_llm_llamacpp.py
new file mode 100644
index 0000000000000000000000000000000000000000..5793c2d2f1b008ccc2a5fe3b183e4698f20dee9c
--- /dev/null
+++ b/tests/unit/llms/test_llm_llamacpp.py
@@ -0,0 +1,73 @@
+import pytest
+from llama_cpp import Llama
+
+from semantic_router.llms import LlamaCppLLM
+from semantic_router.schema import Message
+
+
+@pytest.fixture
+def llamacpp_llm(mocker):
+    mock_llama = mocker.patch("llama_cpp.Llama", spec=Llama)
+    llm = mock_llama.return_value
+    return LlamaCppLLM(llm=llm)
+
+
+class TestLlamaCppLLM:
+    def test_llamacpp_llm_init_success(self, llamacpp_llm):
+        assert llamacpp_llm.name == "llama.cpp"
+        assert llamacpp_llm.temperature == 0.2
+        assert llamacpp_llm.max_tokens == 200
+        assert llamacpp_llm.llm is not None
+
+    def test_llamacpp_llm_call_success(self, llamacpp_llm, mocker):
+        llamacpp_llm.llm.create_chat_completion = mocker.Mock(
+            return_value={"choices": [{"message": {"content": "test"}}]}
+        )
+
+        llm_input = [Message(role="user", content="test")]
+        output = llamacpp_llm(llm_input)
+        assert output == "test"
+
+    def test_llamacpp_llm_grammar(self, llamacpp_llm):
+        llamacpp_llm._grammar()
+
+    def test_llamacpp_extract_function_inputs(self, llamacpp_llm, mocker):
+        llamacpp_llm.llm.create_chat_completion = mocker.Mock(
+            return_value={
+                "choices": [
+                    {"message": {"content": "{'timezone': 'America/New_York'}"}}
+                ]
+            }
+        )
+        test_schema = {
+            "name": "get_time",
+            "description": 'Finds the current time in a specific timezone.\n\n:param timezone: The timezone to find the current time in, should\n    be a valid timezone from the IANA Time Zone Database like\n    "America/New_York" or "Europe/London". Do NOT put the place\n    name itself like "rome", or "new york", you must provide\n    the IANA format.\n:type timezone: str\n:return: The current time in the specified timezone.',
+            "signature": "(timezone: str) -> str",
+            "output": "<class 'str'>",
+        }
+        test_query = "What time is it in America/New_York?"
+
+        llamacpp_llm.extract_function_inputs(
+            query=test_query, function_schema=test_schema
+        )
+
+    def test_llamacpp_extract_function_inputs_invalid(self, llamacpp_llm, mocker):
+        with pytest.raises(ValueError):
+            llamacpp_llm.llm.create_chat_completion = mocker.Mock(
+                return_value={
+                    "choices": [
+                        {"message": {"content": "{'time': 'America/New_York'}"}}
+                    ]
+                }
+            )
+            test_schema = {
+                "name": "get_time",
+                "description": 'Finds the current time in a specific timezone.\n\n:param timezone: The timezone to find the current time in, should\n    be a valid timezone from the IANA Time Zone Database like\n    "America/New_York" or "Europe/London". Do NOT put the place\n    name itself like "rome", or "new york", you must provide\n    the IANA format.\n:type timezone: str\n:return: The current time in the specified timezone.',
+                "signature": "(timezone: str) -> str",
+                "output": "<class 'str'>",
+            }
+            test_query = "What time is it in America/New_York?"
+
+            llamacpp_llm.extract_function_inputs(
+                query=test_query, function_schema=test_schema
+            )