Skip to content
Snippets Groups Projects
Unverified Commit 6864016e authored by James Briggs's avatar James Briggs
Browse files

version 0.0.27 changes

parent 56125726
No related branches found
No related tags found
No related merge requests found
File moved
%% Cell type:markdown id: tags:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aurelio-labs/semantic-router/blob/main/docs/07-ollama-local-execution.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/aurelio-labs/semantic-router/blob/main/docs/07-ollama-local-execution.ipynb)
%% Cell type:markdown id: tags:
# Local Dynamic Routes - With Ollama
%% Cell type:markdown id: tags:
## Fully local Semantic Router with Ollama and HuggingFace Encoder
There are many reasons users might choose to roll their own LLMs rather than use a third-party service. Whether it's due to cost, privacy or compliance, Semantic Router supports the use of "local" LLMs through `llama.cpp`.
Below is an example of using semantic router which leverages Ollama in order to utilize the **OpenHermes** LLM.
%% Cell type:markdown id: tags:
We need `pillow`, `torch` and `transformers` for the HuggingFace encoders.
%% Cell type:markdown id: tags:
## Installing the Library and Dependencies
%% Cell type:code id: tags:
``` python
!pip install semantic_router[local]==0.0.23 \
pillow torch transformers
!pip install -qU "semantic_router[local]==0.0.27"
```
%% Output
Requirement already satisfied: semantic_router[local]==0.0.23 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (0.0.23)
Requirement already satisfied: pillow in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (10.2.0)
Requirement already satisfied: torch in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (2.2.0)
Requirement already satisfied: transformers in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (4.38.0)
Requirement already satisfied: black<24.0.0,>=23.12.1 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from semantic_router[local]==0.0.23) (23.12.1)
Requirement already satisfied: cohere<5.0,>=4.32 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from semantic_router[local]==0.0.23) (4.47)
Requirement already satisfied: colorama<0.5.0,>=0.4.6 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from semantic_router[local]==0.0.23) (0.4.6)
Requirement already satisfied: colorlog<7.0.0,>=6.8.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from semantic_router[local]==0.0.23) (6.8.2)
Requirement already satisfied: llama-cpp-python<0.3.0,>=0.2.28 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from semantic_router[local]==0.0.23) (0.2.45)
Requirement already satisfied: mistralai<0.0.13,>=0.0.12 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from semantic_router[local]==0.0.23) (0.0.12)
Requirement already satisfied: numpy<2.0.0,>=1.25.2 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from semantic_router[local]==0.0.23) (1.26.4)
Requirement already satisfied: openai<2.0.0,>=1.10.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from semantic_router[local]==0.0.23) (1.12.0)
Requirement already satisfied: pydantic<3.0.0,>=2.5.3 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from semantic_router[local]==0.0.23) (2.6.1)
Requirement already satisfied: pyyaml<7.0.0,>=6.0.1 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from semantic_router[local]==0.0.23) (6.0.1)
Requirement already satisfied: filelock in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from torch) (3.13.1)
Requirement already satisfied: typing-extensions>=4.8.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from torch) (4.9.0)
Requirement already satisfied: sympy in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from torch) (1.12)
Requirement already satisfied: networkx in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from torch) (3.2.1)
Requirement already satisfied: jinja2 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from torch) (3.1.3)
Requirement already satisfied: fsspec in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from torch) (2024.2.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from transformers) (0.20.3)
Requirement already satisfied: packaging>=20.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from transformers) (23.2)
Requirement already satisfied: regex!=2019.12.17 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from transformers) (2023.12.25)
Requirement already satisfied: requests in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from transformers) (2.31.0)
Requirement already satisfied: tokenizers<0.19,>=0.14 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from transformers) (0.15.2)
Requirement already satisfied: safetensors>=0.4.1 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from transformers) (0.4.2)
Requirement already satisfied: tqdm>=4.27 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from transformers) (4.66.2)
Requirement already satisfied: click>=8.0.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from black<24.0.0,>=23.12.1->semantic_router[local]==0.0.23) (8.1.7)
Requirement already satisfied: mypy-extensions>=0.4.3 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from black<24.0.0,>=23.12.1->semantic_router[local]==0.0.23) (1.0.0)
Requirement already satisfied: pathspec>=0.9.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from black<24.0.0,>=23.12.1->semantic_router[local]==0.0.23) (0.12.1)
Requirement already satisfied: platformdirs>=2 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from black<24.0.0,>=23.12.1->semantic_router[local]==0.0.23) (4.2.0)
Requirement already satisfied: aiohttp<4.0,>=3.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from cohere<5.0,>=4.32->semantic_router[local]==0.0.23) (3.9.3)
Requirement already satisfied: backoff<3.0,>=2.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from cohere<5.0,>=4.32->semantic_router[local]==0.0.23) (2.2.1)
Requirement already satisfied: fastavro<2.0,>=1.8 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from cohere<5.0,>=4.32->semantic_router[local]==0.0.23) (1.9.4)
Requirement already satisfied: importlib_metadata<7.0,>=6.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from cohere<5.0,>=4.32->semantic_router[local]==0.0.23) (6.11.0)
Requirement already satisfied: urllib3<3,>=1.26 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from cohere<5.0,>=4.32->semantic_router[local]==0.0.23) (2.2.1)
Requirement already satisfied: diskcache>=5.6.1 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from llama-cpp-python<0.3.0,>=0.2.28->semantic_router[local]==0.0.23) (5.6.3)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from jinja2->torch) (2.1.5)
Requirement already satisfied: httpx<0.26.0,>=0.25.2 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from mistralai<0.0.13,>=0.0.12->semantic_router[local]==0.0.23) (0.25.2)
Requirement already satisfied: orjson<4.0.0,>=3.9.10 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from mistralai<0.0.13,>=0.0.12->semantic_router[local]==0.0.23) (3.9.14)
Requirement already satisfied: anyio<5,>=3.5.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from openai<2.0.0,>=1.10.0->semantic_router[local]==0.0.23) (4.2.0)
Requirement already satisfied: distro<2,>=1.7.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from openai<2.0.0,>=1.10.0->semantic_router[local]==0.0.23) (1.9.0)
Requirement already satisfied: sniffio in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from openai<2.0.0,>=1.10.0->semantic_router[local]==0.0.23) (1.3.0)
Requirement already satisfied: annotated-types>=0.4.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from pydantic<3.0.0,>=2.5.3->semantic_router[local]==0.0.23) (0.6.0)
Requirement already satisfied: pydantic-core==2.16.2 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from pydantic<3.0.0,>=2.5.3->semantic_router[local]==0.0.23) (2.16.2)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from requests->transformers) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from requests->transformers) (3.6)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from requests->transformers) (2024.2.2)
Requirement already satisfied: mpmath>=0.19 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from sympy->torch) (1.3.0)
Requirement already satisfied: aiosignal>=1.1.2 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from aiohttp<4.0,>=3.0->cohere<5.0,>=4.32->semantic_router[local]==0.0.23) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from aiohttp<4.0,>=3.0->cohere<5.0,>=4.32->semantic_router[local]==0.0.23) (23.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from aiohttp<4.0,>=3.0->cohere<5.0,>=4.32->semantic_router[local]==0.0.23) (1.4.1)
Requirement already satisfied: multidict<7.0,>=4.5 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from aiohttp<4.0,>=3.0->cohere<5.0,>=4.32->semantic_router[local]==0.0.23) (6.0.5)
Requirement already satisfied: yarl<2.0,>=1.0 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from aiohttp<4.0,>=3.0->cohere<5.0,>=4.32->semantic_router[local]==0.0.23) (1.9.4)
Requirement already satisfied: httpcore==1.* in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from httpx<0.26.0,>=0.25.2->mistralai<0.0.13,>=0.0.12->semantic_router[local]==0.0.23) (1.0.3)
Requirement already satisfied: h11<0.15,>=0.13 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from httpcore==1.*->httpx<0.26.0,>=0.25.2->mistralai<0.0.13,>=0.0.12->semantic_router[local]==0.0.23) (0.14.0)
Requirement already satisfied: zipp>=0.5 in c:\users\siraj\documents\personal\work\aurelio\20240123 semantic router\venvs\semantic_router\lib\site-packages (from importlib_metadata<7.0,>=6.0->cohere<5.0,>=4.32->semantic_router[local]==0.0.23) (3.17.0)
[notice] A new release of pip is available: 23.1.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip
%% Cell type:code id: tags:
``` python
from semantic_router.encoders import HuggingFaceEncoder
encoder = HuggingFaceEncoder()
```
%% Output
c:\Users\Siraj\Documents\Personal\Work\Aurelio\20240123 Semantic Router\venvs\semantic_router\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
%% Cell type:markdown id: tags:
## Define Static Routes
%% Cell type:code id: tags:
``` python
from semantic_router import Route
# we could use this as a guide for our chatbot to avoid political conversations
politics = Route(
name="politics",
utterances=[
"isn't politics the best thing ever",
"why don't you tell me about your political opinions",
"don't you just love the president" "don't you just hate the president",
"they're going to destroy this country!",
"they will save the country!",
],
)
# this could be used as an indicator to our chatbot to switch to a more
# conversational prompt
chitchat = Route(
name="chitchat",
utterances=[
"how's the weather today?",
"how are things going?",
"lovely weather today",
"the weather is horrendous",
"let's go to the chippy",
],
)
# we place both of our decisions together into single list
routes = [politics, chitchat]
```
%% Cell type:markdown id: tags:
## Define Route Layer with Ollama
%% Cell type:code id: tags:
``` python
from semantic_router.layer import RouteLayer
from semantic_router.llms.ollama import OllamaLLM
llm = OllamaLLM(
llm_name="openhermes"
) # Change llm_name if you want to use a different LLM with dynamic routes.
rl = RouteLayer(encoder=encoder, routes=routes, llm=llm)
```
%% Output
2024-02-22 10:59:54 INFO semantic_router.utils.logger local
%% Cell type:markdown id: tags:
## Test Static Routes
%% Cell type:code id: tags:
``` python
rl("don't you love politics?").name
```
%% Output
'politics'
%% Cell type:code id: tags:
``` python
rl("how's the weather today?").name
```
%% Output
'chitchat'
%% Cell type:code id: tags:
``` python
rl("I'm interested in learning about llama 2").name
```
%% Cell type:markdown id: tags:
## Test Dynamic Routes
Dynamic routes work by associating a function with a route. If the input utterance is similar enough to the utterances of the route, such that route is chosen by the semantic router, then this triggers a secondary process:
The LLM we specified in the `RouteLayer` (we specified Ollama, which isn't strictly an LLM, but which defaults to using the `OpenHermes` LLM), is then usde to take a `function_schema`, and the input utterance, and extract values from the input utterance which can be used as arguments for `function` described by the the `funcion_schema`. The returned values can then be used in the `function` to obtain an output.
So, in short, it's a way of generating `function` inputs from an utterance, if that utterance matches the route utterances closely enough.
In the below example the utterance **"what is the time in new york city?"** is used to trigger the "get_time" route, which has the `function_schema` of a likewise named `get_time()` function associated with it. Then Ollama is used to run `OpenHermes` locally, which extracts the correctly formatted IANA timezone (`"America/New York"`), based on this utterance and information we provide it about the `function` in the `function_schema`. The returned stirng "America/New York" can then be used directly in the `get_time()` function to return the actual time in New York city.
%% Cell type:markdown id: tags:
%% Cell type:code id: tags:
``` python
from datetime import datetime
from zoneinfo import ZoneInfo
def get_time(timezone: str) -> str:
"""
Finds the current time in a specific timezone.
:param timezone: The timezone to find the current time in, should
be a valid timezone from the IANA Time Zone Database like
"America/New_York" or "Europe/London". Do NOT put the place
name itself like "rome", or "new york", you must provide
the IANA format.
:type timezone: str
:return: The current time in the specified timezone.
"""
now = datetime.now(ZoneInfo(timezone))
return now.strftime("%H:%M")
```
%% Cell type:code id: tags:
``` python
get_time("America/New_York")
```
%% Output
'01:59'
%% Cell type:code id: tags:
``` python
from semantic_router.utils.function_call import get_schema
schema = get_schema(get_time)
schema
```
%% Output
{'name': 'get_time',
'description': 'Finds the current time in a specific timezone.\n\n:param timezone: The timezone to find the current time in, should\n be a valid timezone from the IANA Time Zone Database like\n "America/New_York" or "Europe/London". Do NOT put the place\n name itself like "rome", or "new york", you must provide\n the IANA format.\n:type timezone: str\n:return: The current time in the specified timezone.\n ',
'signature': '(timezone: str) -> str',
'output': "<class 'str'>"}
%% Cell type:code id: tags:
``` python
time_route = Route(
name="get_time",
utterances=[
"what is the time in new york city?",
"what is the time in london?",
"I live in Rome, what time is it?",
],
function_schema=schema,
)
```
%% Cell type:code id: tags:
``` python
rl.add(time_route)
```
%% Output
2024-02-22 10:59:55 INFO semantic_router.utils.logger Adding `get_time` route
%% Cell type:code id: tags:
``` python
out = rl("what is the time in new york city?")
print(out)
```
%% Output
2024-02-22 11:01:29 INFO semantic_router.utils.logger Extracting function input...
2024-02-22 11:01:32 INFO semantic_router.utils.logger LLM output: {
"timezone": "America/New_York"
}
2024-02-22 11:01:32 INFO semantic_router.utils.logger Function inputs: {'timezone': 'America/New_York'}
name='get_time' function_call={'timezone': 'America/New_York'} similarity_score=None trigger=None
%% Cell type:code id: tags:
``` python
get_time(**out.function_call)
```
%% Output
'02:01'
%% Cell type:code id: tags:
``` python
```
......
%% Cell type:markdown id: tags:
### Partition elements using Unstructured library
%% Cell type:code id: tags:
``` python
# It may take longer to install the package
!pip install -qU \
"unstructured[pdf]==0.12.4" \
"semantic-router==0.0.26"
"semantic-router==0.0.27"
```
%% Cell type:markdown id: tags:
Start by downloading and processing an ArXiv paper.
%% Cell type:code id: tags:
``` python
from unstructured.partition.auto import partition
article_url = "https://arxiv.org/pdf/2402.05131.pdf"
elements = partition(url=article_url, strategy="hi_res", pdf_infer_table_structure=True)
```
%% Output
/Users/jakit/customers/aurelio/semantic-router/.venv/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Conflict between variables skip_infer_table_types: ['pdf', 'jpg', 'png', 'xls', 'xlsx', 'heic'] and pdf_infer_table_structure: True, please reset skip_infer_table_types to turn on table extraction for PDFs.
This function will be deprecated in a future release and `unstructured` will simply use the DEFAULT_MODEL from `unstructured_inference.model.base` to set default model name
Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
%% Cell type:markdown id: tags:
#### Define helper functions
%% Cell type:markdown id: tags:
Validate if parsed title element is a real title
%% Cell type:code id: tags:
``` python
import re
def is_valid_title(title: str) -> bool:
# Rule 1: Title starts with a lowercase letter
if re.match(r"^[a-z]", title):
return False
# Rule 2: Title has a special character (excluding :, -, and .)
if re.search(r"[^\w\s:\-\.]", title):
return False
# Rule 3: Title ends with a dot
if title.endswith("."):
return False
return True
```
%% Cell type:markdown id: tags:
Group elements by valid titles
%% Cell type:code id: tags:
``` python
from unstructured.documents.elements import Element
from colorama import Fore, Style
def group_elements_by_title(elements: list[Element]) -> dict:
grouped_elements = {}
current_title = "Untitled" # Default title for initial text without a title
for element in elements:
element_dict = element.to_dict()
if element_dict.get("type") == "Title":
potential_title = element_dict.get("text", "Untitled")
if is_valid_title(potential_title):
print(f"{Fore.GREEN}{potential_title}: True{Style.RESET_ALL}")
current_title = potential_title
else:
print(f"{Fore.RED}{potential_title}: False{Style.RESET_ALL}")
continue
else:
if current_title not in grouped_elements:
grouped_elements[current_title] = []
else:
grouped_elements[current_title].append(element)
return grouped_elements
```
%% Cell type:markdown id: tags:
Generates chunks grouped elements using semantic RollingWindow splitter
%% Cell type:code id: tags:
``` python
from semantic_router.splitters import RollingWindowSplitter
def create_title_chunks(
grouped_elements: dict, splitter: RollingWindowSplitter
) -> list:
title_with_chunks = []
for title, elements in grouped_elements.items():
if not elements:
continue
combined_element_texts = []
chunks = []
for element in elements:
if not element.text:
continue
element_dict = element.to_dict()
if element_dict.get("type") == "Table":
# Process accumulated text before the table
if combined_element_texts:
splits = splitter(combined_element_texts)
print("-" * 80)
chunks.extend([split.content for split in splits])
combined_element_texts = [] # Reset combined texts after processing
# Add table as a separate chunk
table_text_html = element.metadata.text_as_html
chunks.append(table_text_html)
else:
combined_element_texts.append(element.text)
# Process any remaining accumulated text after the last table
# or if no table was encountered
if combined_element_texts:
splits = splitter(combined_element_texts)
print("-" * 80)
chunks.extend([split.content for split in splits])
if chunks:
title_with_chunks.append({"title": title, "chunks": chunks})
return title_with_chunks
```
%% Cell type:markdown id: tags:
Display chunked text in colors
%% Cell type:code id: tags:
``` python
from IPython.display import display, HTML
import itertools
def print_chunks_by_title(chunks_by_title):
color_cycle = itertools.cycle(["red", "green", "blue", "magenta"])
html_output = ""
for section in chunks_by_title:
title = section["title"]
chunks = section["chunks"]
html_output += f"<h3 style='color: black;'>{title}</h3>"
for chunk in chunks:
color = next(color_cycle)
html_output += f"<p style='color: {color};'>{chunk}</p>"
display(HTML(html_output))
```
%% Cell type:markdown id: tags:
### Process the elements
%% Cell type:code id: tags:
``` python
import os
from semantic_router.encoders import OpenAIEncoder
encoder = OpenAIEncoder(openai_api_key=os.environ["OPENAI_API_KEY"])
splitter = RollingWindowSplitter(
encoder=encoder,
window_size=1, # Compares each element with the previous one
min_split_tokens=1,
max_split_tokens=500,
plot_splits=False,
)
```
%% Cell type:code id: tags:
``` python
grouped_elements = group_elements_by_title(elements)
```
%% Output
et! ee: False
b e F 0 1: False
] L C . s c [: False
Financial Report Chunking for Effective Retrieval Augmented Generation: True
Introduction: True
2 Jimeno Yepes et al.: False
1 https://www.sec.gov 2 https://www.sec.gov/files/cf-frm.pdf: False
2 Related work: True
4 Jimeno Yepes et al.: False
3 Methods: True
3.1 RAG setting for the experiments: True
3.2 Indexing and retrieval: True
7 https://weaviate.io/developers/weaviate 8 https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-: False
v1: False
3.3 Generation: True
Question: {query}: False
3.4 Chunking: True
3.5 Dataset: True
4 Results: True
11 https://platform.openai.com/docs/guides/embeddings/limitations-risks: False
10 Jimeno Yepes et al.: False
5 Discussion: True
12 Jimeno Yepes et al.: False
6 Conclusions and Future Work: True
References: True
%% Cell type:code id: tags:
``` python
chunks_by_title = create_title_chunks(grouped_elements, splitter)
```
%% Output
/Users/jakit/customers/aurelio/semantic-router/.venv/lib/python3.9/site-packages/numpy/core/fromnumeric.py:3504: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/Users/jakit/customers/aurelio/semantic-router/.venv/lib/python3.9/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)
/Users/jakit/customers/aurelio/semantic-router/.venv/lib/python3.9/site-packages/numpy/core/_methods.py:206: RuntimeWarning: Degrees of freedom <= 0 for slice
ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/Users/jakit/customers/aurelio/semantic-router/.venv/lib/python3.9/site-packages/numpy/core/_methods.py:163: RuntimeWarning: invalid value encountered in divide
arrmean = um.true_divide(arrmean, div, out=arrmean,
/Users/jakit/customers/aurelio/semantic-router/.venv/lib/python3.9/site-packages/numpy/core/_methods.py:198: RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)
2024-02-26 17:07:32 INFO semantic_router.utils.logger Optimal threshold 0.5 found with median tokens (27.0) in target range (1-500).
2024-02-26 17:07:32 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 1
- Total Splits: 1
- Splits by Threshold: 0
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 27
- Maximum Token Size of Split: 27
- Similarity Split Ratio: 0.00
--------------------------------------------------------------------------------
2024-02-26 17:07:33 INFO semantic_router.utils.logger Optimal threshold 0.7912974224053915 found with median tokens (136.5) in target range (1-500).
2024-02-26 17:07:33 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 3
- Total Splits: 2
- Splits by Threshold: 1
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 19
- Maximum Token Size of Split: 254
- Similarity Split Ratio: 0.50
--------------------------------------------------------------------------------
2024-02-26 17:07:33 INFO semantic_router.utils.logger Optimal threshold 0.8514465425347408 found with median tokens (129.5) in target range (1-500).
2024-02-26 17:07:33 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 7
- Total Splits: 4
- Splits by Threshold: 3
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 64
- Maximum Token Size of Split: 400
- Similarity Split Ratio: 0.75
--------------------------------------------------------------------------------
2024-02-26 17:07:34 INFO semantic_router.utils.logger Optimal threshold 0.8371609601655312 found with median tokens (154.0) in target range (1-500).
2024-02-26 17:07:34 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 7
- Total Splits: 4
- Splits by Threshold: 3
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 37
- Maximum Token Size of Split: 362
- Similarity Split Ratio: 0.75
--------------------------------------------------------------------------------
2024-02-26 17:07:34 INFO semantic_router.utils.logger Optimal threshold 0.8004127909380481 found with median tokens (46.0) in target range (1-500).
2024-02-26 17:07:34 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 5
- Total Splits: 3
- Splits by Threshold: 2
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 15
- Maximum Token Size of Split: 161
- Similarity Split Ratio: 0.67
--------------------------------------------------------------------------------
2024-02-26 17:07:35 INFO semantic_router.utils.logger Optimal threshold 0.7219220831968602 found with median tokens (94.0) in target range (1-500).
2024-02-26 17:07:35 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 6
- Total Splits: 3
- Splits by Threshold: 2
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 8
- Maximum Token Size of Split: 100
- Similarity Split Ratio: 0.67
--------------------------------------------------------------------------------
2024-02-26 17:07:35 INFO semantic_router.utils.logger Optimal threshold 0.7865543500746407 found with median tokens (92.5) in target range (1-500).
2024-02-26 17:07:35 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 4
- Total Splits: 2
- Splits by Threshold: 1
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 12
- Maximum Token Size of Split: 173
- Similarity Split Ratio: 0.50
--------------------------------------------------------------------------------
2024-02-26 17:07:36 INFO semantic_router.utils.logger Optimal threshold 0.7759885849518695 found with median tokens (73.0) in target range (1-500).
2024-02-26 17:07:36 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 9
- Total Splits: 5
- Splits by Threshold: 4
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 15
- Maximum Token Size of Split: 210
- Similarity Split Ratio: 0.80
--------------------------------------------------------------------------------
2024-02-26 17:07:36 INFO semantic_router.utils.logger Optimal threshold 0.7356350410401438 found with median tokens (23.0) in target range (1-500).
2024-02-26 17:07:36 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 5
- Total Splits: 3
- Splits by Threshold: 2
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 8
- Maximum Token Size of Split: 198
- Similarity Split Ratio: 0.67
--------------------------------------------------------------------------------
2024-02-26 17:07:37 INFO semantic_router.utils.logger Optimal threshold 0.7993056373716161 found with median tokens (14.0) in target range (1-500).
2024-02-26 17:07:37 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 5
- Total Splits: 3
- Splits by Threshold: 2
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 10
- Maximum Token Size of Split: 95
- Similarity Split Ratio: 0.67
--------------------------------------------------------------------------------
2024-02-26 17:07:37 INFO semantic_router.utils.logger Optimal threshold 0.7946781280578719 found with median tokens (104.5) in target range (1-500).
2024-02-26 17:07:37 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 4
- Total Splits: 2
- Splits by Threshold: 1
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 87
- Maximum Token Size of Split: 122
- Similarity Split Ratio: 0.50
--------------------------------------------------------------------------------
2024-02-26 17:07:38 INFO semantic_router.utils.logger Optimal threshold 0.7079124801171096 found with median tokens (15.0) in target range (1-500).
2024-02-26 17:07:38 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 2
- Total Splits: 1
- Splits by Threshold: 0
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 15
- Maximum Token Size of Split: 15
- Similarity Split Ratio: 0.00
--------------------------------------------------------------------------------
2024-02-26 17:07:38 INFO semantic_router.utils.logger Optimal threshold 0.8324466121743902 found with median tokens (110.5) in target range (1-500).
2024-02-26 17:07:38 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 12
- Total Splits: 6
- Splits by Threshold: 5
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 57
- Maximum Token Size of Split: 254
- Similarity Split Ratio: 0.83
--------------------------------------------------------------------------------
2024-02-26 17:07:39 INFO semantic_router.utils.logger Optimal threshold 0.8128022034342155 found with median tokens (16.5) in target range (1-500).
2024-02-26 17:07:39 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 3
- Total Splits: 2
- Splits by Threshold: 1
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 4
- Maximum Token Size of Split: 29
- Similarity Split Ratio: 0.50
--------------------------------------------------------------------------------
2024-02-26 17:07:39 INFO semantic_router.utils.logger Optimal threshold 0.786452236757286 found with median tokens (173.5) in target range (1-500).
2024-02-26 17:07:39 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 8
- Total Splits: 4
- Splits by Threshold: 3
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 8
- Maximum Token Size of Split: 241
- Similarity Split Ratio: 0.75
--------------------------------------------------------------------------------
2024-02-26 17:07:40 INFO semantic_router.utils.logger Optimal threshold 0.8250029487527775 found with median tokens (41.0) in target range (1-500).
2024-02-26 17:07:40 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 2
- Total Splits: 1
- Splits by Threshold: 0
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 41
- Maximum Token Size of Split: 41
- Similarity Split Ratio: 0.00
--------------------------------------------------------------------------------
2024-02-26 17:07:41 INFO semantic_router.utils.logger Optimal threshold 0.8086076732442027 found with median tokens (108.0) in target range (1-500).
2024-02-26 17:07:41 INFO semantic_router.utils.logger Splitting Statistics:
- Total Documents: 45
- Total Splits: 23
- Splits by Threshold: 22
- Splits by Max Chunk Size: 0
- Last Split: 1
- Minimum Token Size of Split: 4
- Maximum Token Size of Split: 513
- Similarity Split Ratio: 0.96
--------------------------------------------------------------------------------
%% Cell type:code id: tags:
``` python
print_chunks_by_title(chunks_by_title)
```
%% Output
......
[tool.poetry]
name = "semantic-router"
version = "0.0.26"
version = "0.0.27"
description = "Super fast semantic router for AI decision making"
authors = [
"James Briggs <james@aurelio.ai>",
......
......@@ -4,4 +4,4 @@ from semantic_router.route import Route
__all__ = ["RouteLayer", "HybridRouteLayer", "Route", "LayerConfig"]
__version__ = "0.0.26"
__version__ = "0.0.27"
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment