Skip to content
Snippets Groups Projects
Commit c1beddfe authored by Simonas's avatar Simonas
Browse files

feat: RollingWindowSplitter example

parent ab0578c6
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id: tags:
``` python
text = '''
In a recent surge of social media discussions on Weibo, Chinese netizens have been engaging in conversations about the struggles and challenges of earning money. The online debate sparked a wave of opinions and perspectives on the relationship between hard work, high pay, and finding contentment. Among the tweets, several users pontificated that one should avoid earning "too much hard-earned money."
The tweets and discussions revolve around the idea that working too hard for one's income can have a detrimental effect on one's life, both physically and mentally. Some users advocate for finding opportunities that align with one's strengths and passions, rather than simply focusing on high-paying jobs that may require excessive hours and intense labor.
One Weibo user pontificates, "Don't earn that much hard-earned money," a sentiment echoed by others with tweets such as, "Why is it that when earning money, that process always has to be so tough?" This question is followed by a comparison between two types of people - those who are used to earning money the hard way and those who seem to effortlessly obtain wealth. While the former group is depicted as having been taught to suffer from a young age, the latter is shown as being able to focus solely on their natural talents and thriving in their niche advantageously.
Discussions on the platform draw attention to a variety of issues that those who earn money the hard way might face. For example, they are described as likely having to work overtime, forgo time off for illness or rest, and maintain an unyielding dedication to their occupation, which often results in a never-ending cycle of work without any perceived progression in their lives.
Another tweet that captures this sentiment reads, "Drowning in more work and poverty despite trying harder and harder," pointing to a sense of despair and dissatisfaction that comes with work that is both disproportionately demanding and inadequately rewarding. Critics also note how the pursuit of hard-earned money could potentially create physical and mental health risks due to the unrelenting pressure and stress that these jobs might impose.
Conversely, those in favor of earning money with less difficulty contend that it's crucial to harness one's strengths and passions to create opportunities that yield financial success without the need for excessive labor. The debate revolves around the concept that people should seek out ways to work smarter, not harder, especially if it means a healthier and more fulfilling lifestyle.
In fact, the notion of a "vicious cycle," often attributed to those chasing hard-earned money, is juxtaposed with an idealized image of someone operating in their zone of excellence. Confidently focused on their strengths, such individuals are depicted as enjoying a more relaxed and less stressful work environment, one in which they can thrive without the need for never-ending overtime or self-sacrifice.
Some tweets even extend this sentiment to the broader socio-economic context, observing how wealth is not merely derived from manual labor or high-paying positions requiring extraordinary work hours. The tweets emphasize the importance of cultivating an entrepreneurial spirit and a penchant for innovative thinking, especially in the modern digital age.
One user writes, "Too hard-earned money isn't worth it. Learn how to make money using your brain, not your body," while another suggests, "Love will flow towards those who are not lacking in love, and money will flow towards those who are not lacking in money!"
While some of the discussions take a somewhat passive-aggressive view, others acknowledge that financial security and comfort might not always be possible for everyone. In a more realistic tone, a user remarks, "If life were so easy that diligence led to wealth, then the world's richest person would be the best worker bee. But that's not the case." This acknowledgment underscores the complexities of the economy and the role that factors like luck, connections, and a rapidly evolving job market can play in financial success.
Some users are quick to criticize the notion that earning money the hard way should be avoided, with one tweet expressing, "The person who advises you to avoid hard-earned money is likely a scammer who profits off providing emotional value in exchange for exploitation." Others argue that while it's essential to find enjoyment and fulfillment in one's work, it's crucial not to shun or belittle those who choose to work in physically demanding or high-paying industries.
Overall, the Weibo discussions offer a fascinating insight into the complexities of the modern Chinese labor market and the work-life balance that people strive to achieve. As in many countries, striking the right balance between work and play is an ongoing challenge for many Chinese citizens. However, the conversations on Weibo signal an increasing awareness of the importance of finding meaningful, fulfilling, and financially rewarding work that doesn't necessitate excessive sacrifice or sufferance.
In the end, as one user succinctly puts it, "Make sure you're earning your money in a way that brings you joy and satisfaction. That's the only way to ensure that your life doesn't become a never-ending cycle of hard work without any tangible progress."
In this context, social media discussions focusing on the trials and tribulations of earning money serve not only as an outlet for venting frustrations but also as a means of promoting dialogue and shared understanding about the challenges faced by workers across all industries. These virtual conversations sparked by tweets and in-depth discussions likely resonate with a wide swath of Chinese citizens struggling to navigate the complexities of balancing a career that pays well with one that brings them joy, fulfillment, and a sense of purpose.
As the discussions on Weibo continue to evolve and unfold, it is evident that the discourse around work, money, and life satisfaction holds the potential to inspire meaningful change and shift societal attitudes towards a more holistic, balanced, and humane understanding of success and prosperity.
---
Note: The translated tweets and user quotes from Chinese to English were used as the foundation for the long-form news article. The author tried to maintain the integrity of the original content in the translation while adapting it to fit a journalistic format. No inaccuracies were introduced during translation, and the opinion-based nature of the original content was preserved while maintaining objectivity.
Heart count: 0/2
Note: The author did not include any Chinese characters in the final response.
Collapse
'''
```
%% Cell type:code id: tags:
``` python
from semantic_router.splitters import RollingWindowSplitter
from semantic_router.encoders import OpenAIEncoder
splitter = RollingWindowSplitter(
encoder=OpenAIEncoder(),
min_split_tokens=50,
max_split_tokens=300,
window_size=5, # sentences
plot_splits=True)
```
%% Cell type:code id: tags:
``` python
splits = splitter([text])
```
%% Output
2024-02-23 09:50:04 WARNING semantic_router.utils.logger Single document exceeds the maximum token limit of 300. Splitting to sentences before semantically splitting.
2024-02-23 09:50:05 INFO semantic_router.utils.logger Iteration 0: Trying threshold: 0.8881277932028191
2024-02-23 09:50:05 INFO semantic_router.utils.logger Iteration 0: Median tokens per split: 24.0
2024-02-23 09:50:05 INFO semantic_router.utils.logger Iteration 0: Adjusting high to 0.8781277932028191
2024-02-23 09:50:05 INFO semantic_router.utils.logger Iteration 1: Trying threshold: 0.8687934834140205
2024-02-23 09:50:05 INFO semantic_router.utils.logger Iteration 1: Median tokens per split: 34.5
2024-02-23 09:50:05 INFO semantic_router.utils.logger Iteration 1: Adjusting high to 0.8587934834140205
2024-02-23 09:50:05 INFO semantic_router.utils.logger Final optimal threshold: 0.8687934834140205
2024-02-23 09:50:05 INFO semantic_router.utils.logger Split finalized with 218 tokens due to threshold 0.8687934834140205.
2024-02-23 09:50:05 INFO semantic_router.utils.logger Split finalized with 262 tokens due to exceeding token limit of 300.
2024-02-23 09:50:05 INFO semantic_router.utils.logger Split finalized with 137 tokens due to threshold 0.8687934834140205.
2024-02-23 09:50:05 INFO semantic_router.utils.logger Split finalized with 249 tokens due to threshold 0.8687934834140205.
2024-02-23 09:50:05 INFO semantic_router.utils.logger Split finalized with 117 tokens due to threshold 0.8687934834140205.
2024-02-23 09:50:05 INFO semantic_router.utils.logger Split finalized with 171 tokens due to threshold 0.8687934834140205.
2024-02-23 09:50:05 INFO semantic_router.utils.logger Split finalized with 72 tokens due to threshold 0.8687934834140205.
2024-02-23 09:50:05 INFO semantic_router.utils.logger Final split added with 23 tokens due to remaining documents.
%% Cell type:code id: tags:
``` python
from semantic_router.splitters.utils import plot_splits
plot_splits(splits)
```
%% Output
Split 1, tokens 218:
In a recent surge of social media discussions on Weibo, Chinese netizens have been engaging in conversations about the struggles and challenges of earning money. The online debate sparked a wave of opinions and perspectives on the relationship between hard work, high pay, and finding contentment. Among the tweets, several users pontificated that one should avoid earning "too much hard-earned money." The tweets and discussions revolve around the idea that working too hard for one's income can have a detrimental effect on one's life, both physically and mentally. Some users advocate for finding opportunities that align with one's strengths and passions, rather than simply focusing on high-paying jobs that may require excessive hours and intense labor. One Weibo user pontificates, "Don't earn that much hard-earned money," a sentiment echoed by others with tweets such as, "Why is it that when earning money, that process always has to be so tough?" This question is followed by a comparison between two types of people - those who are used to earning money the hard way and those who seem to effortlessly obtain wealth.
----------------------------------------------------------------------------------------
Split 2, tokens 262:
While the former group is depicted as having been taught to suffer from a young age, the latter is shown as being able to focus solely on their natural talents and thriving in their niche advantageously. Discussions on the platform draw attention to a variety of issues that those who earn money the hard way might face. For example, they are described as likely having to work overtime, forgo time off for illness or rest, and maintain an unyielding dedication to their occupation, which often results in a never-ending cycle of work without any perceived progression in their lives. Another tweet that captures this sentiment reads, "Drowning in more work and poverty despite trying harder and harder," pointing to a sense of despair and dissatisfaction that comes with work that is both disproportionately demanding and inadequately rewarding. Critics also note how the pursuit of hard-earned money could potentially create physical and mental health risks due to the unrelenting pressure and stress that these jobs might impose. Conversely, those in favor of earning money with less difficulty contend that it's crucial to harness one's strengths and passions to create opportunities that yield financial success without the need for excessive labor. The debate revolves around the concept that people should seek out ways to work smarter, not harder, especially if it means a healthier and more fulfilling lifestyle.
----------------------------------------------------------------------------------------
Split 3, tokens 137:
In fact, the notion of a "vicious cycle," often attributed to those chasing hard-earned money, is juxtaposed with an idealized image of someone operating in their zone of excellence. Confidently focused on their strengths, such individuals are depicted as enjoying a more relaxed and less stressful work environment, one in which they can thrive without the need for never-ending overtime or self-sacrifice. Some tweets even extend this sentiment to the broader socio-economic context, observing how wealth is not merely derived from manual labor or high-paying positions requiring extraordinary work hours. The tweets emphasize the importance of cultivating an entrepreneurial spirit and a penchant for innovative thinking, especially in the modern digital age.
----------------------------------------------------------------------------------------
Split 4, tokens 249:
One user writes, "Too hard-earned money isn't worth it. Learn how to make money using your brain, not your body," while another suggests, "Love will flow towards those who are not lacking in love, and money will flow towards those who are not lacking in money!" While some of the discussions take a somewhat passive-aggressive view, others acknowledge that financial security and comfort might not always be possible for everyone. In a more realistic tone, a user remarks, "If life were so easy that diligence led to wealth, then the world's richest person would be the best worker bee. But that's not the case." This acknowledgment underscores the complexities of the economy and the role that factors like luck, connections, and a rapidly evolving job market can play in financial success. Some users are quick to criticize the notion that earning money the hard way should be avoided, with one tweet expressing, "The person who advises you to avoid hard-earned money is likely a scammer who profits off providing emotional value in exchange for exploitation." Others argue that while it's essential to find enjoyment and fulfillment in one's work, it's crucial not to shun or belittle those who choose to work in physically demanding or high-paying industries.
----------------------------------------------------------------------------------------
Split 5, tokens 117:
Overall, the Weibo discussions offer a fascinating insight into the complexities of the modern Chinese labor market and the work-life balance that people strive to achieve. As in many countries, striking the right balance between work and play is an ongoing challenge for many Chinese citizens. However, the conversations on Weibo signal an increasing awareness of the importance of finding meaningful, fulfilling, and financially rewarding work that doesn't necessitate excessive sacrifice or sufferance. In the end, as one user succinctly puts it, "Make sure you're earning your money in a way that brings you joy and satisfaction.
----------------------------------------------------------------------------------------
Split 6, tokens 171:
That's the only way to ensure that your life doesn't become a never-ending cycle of hard work without any tangible progress." In this context, social media discussions focusing on the trials and tribulations of earning money serve not only as an outlet for venting frustrations but also as a means of promoting dialogue and shared understanding about the challenges faced by workers across all industries. These virtual conversations sparked by tweets and in-depth discussions likely resonate with a wide swath of Chinese citizens struggling to navigate the complexities of balancing a career that pays well with one that brings them joy, fulfillment, and a sense of purpose. As the discussions on Weibo continue to evolve and unfold, it is evident that the discourse around work, money, and life satisfaction holds the potential to inspire meaningful change and shift societal attitudes towards a more holistic, balanced, and humane understanding of success and prosperity.
----------------------------------------------------------------------------------------
Split 7, tokens 72:
--- Note: The translated tweets and user quotes from Chinese to English were used as the foundation for the long-form news article. The author tried to maintain the integrity of the original content in the translation while adapting it to fit a journalistic format. No inaccuracies were introduced during translation, and the opinion-based nature of the original content was preserved while maintaining objectivity.
----------------------------------------------------------------------------------------
Split 8, tokens 23:
Heart count: 0/2 Note: The author did not include any Chinese characters in the final response. Collapse
----------------------------------------------------------------------------------------
%% Cell type:code id: tags:
``` python
```
Source diff could not be displayed: it is too large. Options to address this: view the blob.
...@@ -31,6 +31,10 @@ llama-cpp-python = {version = "^0.2.28", optional = true} ...@@ -31,6 +31,10 @@ llama-cpp-python = {version = "^0.2.28", optional = true}
black = "^23.12.1" black = "^23.12.1"
colorama = "^0.4.6" colorama = "^0.4.6"
pinecone-client = {version="^3.0.0", optional = true} pinecone-client = {version="^3.0.0", optional = true}
update = "^0.0.1"
tiktoken = "^0.6.0"
regex = "^2023.12.25"
[tool.poetry.extras] [tool.poetry.extras]
hybrid = ["pinecone-text"] hybrid = ["pinecone-text"]
fastembed = ["fastembed"] fastembed = ["fastembed"]
...@@ -48,6 +52,7 @@ mypy = "^1.7.1" ...@@ -48,6 +52,7 @@ mypy = "^1.7.1"
types-pyyaml = "^6.0.12.12" types-pyyaml = "^6.0.12.12"
types-requests = "^2.31.0" types-requests = "^2.31.0"
termcolor = "^2.4.0" termcolor = "^2.4.0"
matplot = "^0.1.9"
[build-system] [build-system]
requires = ["poetry-core"] requires = ["poetry-core"]
......
...@@ -77,6 +77,12 @@ class Message(BaseModel): ...@@ -77,6 +77,12 @@ class Message(BaseModel):
class DocumentSplit(BaseModel): class DocumentSplit(BaseModel):
docs: List[str] docs: list[str]
is_triggered: bool = False is_triggered: bool = False
triggered_score: Optional[float] = None triggered_score: float | None = None
token_count: int | None = None
metadata: dict | None = None
@property
def content(self) -> str:
return " ".join(self.docs)
from semantic_router.splitters.base import BaseSplitter
from semantic_router.splitters.consecutive_sim import ConsecutiveSimSplitter
from semantic_router.splitters.cumulative_sim import CumulativeSimSplitter
from semantic_router.splitters.rolling_window import RollingWindowSplitter
__all__ = [
"BaseSplitter",
"ConsecutiveSimSplitter",
"CumulativeSimSplitter",
"RollingWindowSplitter",
]
from itertools import cycle from itertools import cycle
from typing import List from typing import List, Optional
from pydantic.v1 import BaseModel from pydantic.v1 import BaseModel
from termcolor import colored from termcolor import colored
...@@ -12,6 +12,8 @@ class BaseSplitter(BaseModel): ...@@ -12,6 +12,8 @@ class BaseSplitter(BaseModel):
name: str name: str
encoder: BaseEncoder encoder: BaseEncoder
score_threshold: float score_threshold: float
min_split_tokens: Optional[int] = None
max_split_tokens: Optional[int] = None
def __call__(self, docs: List[str]) -> List[DocumentSplit]: def __call__(self, docs: List[str]) -> List[DocumentSplit]:
raise NotImplementedError("Subclasses must implement this method") raise NotImplementedError("Subclasses must implement this method")
......
from typing import List
import numpy as np import numpy as np
from matplotlib import pyplot as plt
from nltk.tokenize import word_tokenize
from semantic_router.encoders.base import BaseEncoder
from semantic_router.encoders.base import BaseEncoder
from semantic_router.schema import DocumentSplit from semantic_router.schema import DocumentSplit
from semantic_router.splitters.base import BaseSplitter from semantic_router.splitters.utils import split_to_sentences, tiktoken_length
from semantic_router.utils.logger import logger from semantic_router.utils.logger import logger
class RollingWindowSplitter(BaseSplitter): class RollingWindowSplitter:
"""
A splitter that divides documents into segments based on semantic similarity
using a rolling window approach.
It adjusts the similarity threshold dynamically.
Splitting is based:
- On the similarity threshold
- On the maximum token limit for a split
Attributes:
encoder (Callable): A function to encode documents into semantic vectors.
score_threshold (float): Initial threshold for similarity scores to decide
splits.
window_size (int): Size of the rolling window to calculate document context.
plot_splits (bool): Whether to plot the similarity scores and splits for
visualization.
min_split_tokens (int): Minimum number of tokens for a valid document split.
max_split_tokens (int): Maximum number of tokens a split can contain.
split_tokens_tolerance (int): Tolerance in token count to still consider a split
valid.
threshold_step_size (float): Step size to adjust the similarity threshold during
optimization.
"""
def __init__( def __init__(
self, self,
encoder: BaseEncoder, encoder: BaseEncoder,
score_threshold=0.3, threshold_adjustment: float = 0.01,
window_size=5, window_size=5,
plot_splits=False,
min_split_tokens=100, min_split_tokens=100,
max_split_tokens=300, max_split_tokens=300,
split_tokens_tolerance=10, split_tokens_tolerance=10,
threshold_step_size=0.01, plot_splits=False,
): ):
self.calculated_threshold: float
self.encoder = encoder self.encoder = encoder
self.score_threshold = score_threshold self.threshold_adjustment = threshold_adjustment
self.window_size = window_size self.window_size = window_size
self.plot_splits = plot_splits self.plot_splits = plot_splits
self.min_split_tokens = min_split_tokens self.min_split_tokens = min_split_tokens
self.max_split_tokens = max_split_tokens self.max_split_tokens = max_split_tokens
self.split_tokens_tolerance = split_tokens_tolerance self.split_tokens_tolerance = split_tokens_tolerance
self.threshold_step_size = threshold_step_size
def encode_documents(self, docs: list[str]) -> np.ndarray: def encode_documents(self, docs: list[str]) -> np.ndarray:
return np.array(self.encoder(docs)) try:
embeddings = self.encoder(docs)
def find_optimal_threshold(self, docs: list[str], encoded_docs: np.ndarray): return np.array(embeddings)
logger.info(f"Number of documents for finding optimal threshold: {len(docs)}") except Exception as e:
token_counts = [len(word_tokenize(doc)) for doc in docs] logger.error(f"Error encoding documents {docs}: {e}")
low, high = 0, 1 raise
while low <= high:
self.score_threshold = (low + high) / 2
similarity_scores = self.calculate_similarity_scores(encoded_docs)
split_indices = self.find_split_indices(similarity_scores)
average_tokens = np.mean(
[
sum(token_counts[start:end])
for start, end in zip(
[0] + split_indices, split_indices + [len(token_counts)]
)
]
)
if (
self.min_split_tokens - self.split_tokens_tolerance
<= average_tokens
<= self.max_split_tokens + self.split_tokens_tolerance
):
break
elif average_tokens < self.min_split_tokens:
high = self.score_threshold - self.threshold_step_size
else:
low = self.score_threshold + self.threshold_step_size
def calculate_similarity_scores(self, encoded_docs: np.ndarray) -> list[float]: def calculate_similarity_scores(self, encoded_docs: np.ndarray) -> list[float]:
raw_similarities = [] raw_similarities = []
...@@ -97,11 +47,68 @@ class RollingWindowSplitter(BaseSplitter): ...@@ -97,11 +47,68 @@ class RollingWindowSplitter(BaseSplitter):
return raw_similarities return raw_similarities
def find_split_indices(self, similarities: list[float]) -> list[int]: def find_split_indices(self, similarities: list[float]) -> list[int]:
return [ split_indices = []
idx + 1 for idx in range(1, len(similarities)):
for idx, sim in enumerate(similarities) if similarities[idx] < self.calculated_threshold:
if sim < self.score_threshold split_indices.append(idx + 1)
] return split_indices
def find_optimal_threshold(self, docs: list[str], encoded_docs: np.ndarray):
token_counts = [tiktoken_length(doc) for doc in docs]
cumulative_token_counts = np.cumsum([0] + token_counts)
similarity_scores = self.calculate_similarity_scores(encoded_docs)
# Analyze the distribution of similarity scores to set initial bounds
median_score = np.median(similarity_scores)
std_dev = np.std(similarity_scores)
# Set initial bounds based on median and standard deviation
low = max(0.0, float(median_score - std_dev))
high = min(1.0, float(median_score + std_dev))
iteration = 0
while low <= high:
self.calculated_threshold = (low + high) / 2
logger.info(
f"Iteration {iteration}: Trying threshold: {self.calculated_threshold}"
)
split_indices = self.find_split_indices(similarity_scores)
# Calculate the token counts for each split using the cumulative sums
split_token_counts = [
cumulative_token_counts[end] - cumulative_token_counts[start]
for start, end in zip(
[0] + split_indices, split_indices + [len(token_counts)]
)
]
# Calculate the median token count for the splits
median_tokens = np.median(split_token_counts)
logger.info(
f"Iteration {iteration}: Median tokens per split: {median_tokens}"
)
if (
self.min_split_tokens - self.split_tokens_tolerance
<= median_tokens
<= self.max_split_tokens + self.split_tokens_tolerance
):
logger.info(
f"Iteration {iteration}: "
f"Optimal threshold {self.calculated_threshold} found "
f"with median tokens ({median_tokens}) in target range "
f" {self.min_split_tokens}-{self.max_split_tokens}."
)
break
elif median_tokens < self.min_split_tokens:
high = self.calculated_threshold - self.threshold_adjustment
logger.info(f"Iteration {iteration}: Adjusting high to {high}")
else:
low = self.calculated_threshold + self.threshold_adjustment
logger.info(f"Iteration {iteration}: Adjusting low to {low}")
iteration += 1
logger.info(f"Final optimal threshold: {self.calculated_threshold}")
return self.calculated_threshold
def split_documents( def split_documents(
self, docs: list[str], split_indices: list[int], similarities: list[float] self, docs: list[str], split_indices: list[int], similarities: list[float]
...@@ -114,93 +121,98 @@ class RollingWindowSplitter(BaseSplitter): ...@@ -114,93 +121,98 @@ class RollingWindowSplitter(BaseSplitter):
or when a split point is reached and the minimum token requirement is met, or when a split point is reached and the minimum token requirement is met,
the current split is finalized and added to the list of splits. the current split is finalized and added to the list of splits.
""" """
token_counts = [len(word_tokenize(doc)) for doc in docs] token_counts = [tiktoken_length(doc) for doc in docs]
splits: List[DocumentSplit] = [] splits, current_split = [], []
current_split: List[str] = []
current_tokens_count = 0 current_tokens_count = 0
for doc_idx, doc in enumerate(docs): for doc_idx, doc in enumerate(docs):
doc_token_count = token_counts[doc_idx] doc_token_count = token_counts[doc_idx]
# Check if current document causes token count to exceed max limit
if (
current_tokens_count + doc_token_count > self.max_split_tokens
and current_tokens_count >= self.min_split_tokens
):
splits.append(
DocumentSplit(docs=current_split.copy(), is_triggered=True)
)
logger.info(
f"Split finalized with {current_tokens_count} tokens due to "
f"exceeding token limit of {self.max_split_tokens}."
)
current_split, current_tokens_count = [], 0
current_split.append(doc)
current_tokens_count += doc_token_count
# Check if current index is a split point based on similarity # Check if current index is a split point based on similarity
if doc_idx + 1 in split_indices or doc_idx == len(docs) - 1: if doc_idx + 1 in split_indices:
if current_tokens_count >= self.min_split_tokens: if current_tokens_count + doc_token_count >= self.min_split_tokens:
if doc_idx < len(similarities): # Include the current document before splitting
triggered_score = similarities[doc_idx] # if it doesn't exceed the max limit
splits.append( current_split.append(doc)
DocumentSplit( current_tokens_count += doc_token_count
docs=current_split.copy(),
is_triggered=True, triggered_score = (
triggered_score=triggered_score, similarities[doc_idx] if doc_idx < len(similarities) else None
) )
) splits.append(
logger.info( DocumentSplit(
f"Split finalized with {current_tokens_count} tokens due to" docs=current_split.copy(),
f" similarity score {triggered_score:.2f}." is_triggered=True,
) triggered_score=triggered_score,
else: token_count=current_tokens_count,
# This case handles the end of the document list
# where there's no similarity score
splits.append(
DocumentSplit(docs=current_split.copy(), is_triggered=False)
) )
logger.info( )
f"Split finalized with {current_tokens_count} tokens " logger.info(
"at the end of the document list." f"Split finalized with {current_tokens_count} tokens due to "
f"threshold {self.calculated_threshold}."
)
current_split, current_tokens_count = [], 0
continue # Move to the next document after splitting
# Check if adding the current document exceeds the max token limit
if current_tokens_count + doc_token_count > self.max_split_tokens:
if current_tokens_count >= self.min_split_tokens:
splits.append(
DocumentSplit(
docs=current_split.copy(),
is_triggered=False,
triggered_score=None,
token_count=current_tokens_count,
) )
)
logger.info(
f"Split finalized with {current_tokens_count} tokens due to "
f"exceeding token limit of {self.max_split_tokens}."
)
current_split, current_tokens_count = [], 0 current_split, current_tokens_count = [], 0
# Ensure any remaining documents are included in the final token count current_split.append(doc)
current_tokens_count += doc_token_count
# Handle the last split
if current_split: if current_split:
splits.append(DocumentSplit(docs=current_split.copy(), is_triggered=False)) splits.append(
DocumentSplit(
docs=current_split.copy(),
is_triggered=False,
triggered_score=None,
token_count=current_tokens_count,
)
)
logger.info( logger.info(
f"Final split added with {current_tokens_count} tokens " f"Final split added with {current_tokens_count} "
"due to remaining documents." "tokens due to remaining documents."
) )
# Validation # Validation to ensure no tokens are lost during the split
original_token_count = sum(token_counts) original_token_count = sum(token_counts)
split_token_count = sum( split_token_count = sum(
[len(word_tokenize(doc)) for split in splits for doc in split.docs] [tiktoken_length(doc) for split in splits for doc in split.docs]
) )
logger.debug(
f"Original Token Count: {original_token_count}, "
f"Split Token Count: {split_token_count}"
)
if original_token_count != split_token_count: if original_token_count != split_token_count:
logger.error( logger.error(
f"Token count mismatch: {original_token_count} != {split_token_count}" f"Token count mismatch: {original_token_count} != {split_token_count}"
) )
for i, split in enumerate(splits):
split_token_count = sum([len(word_tokenize(doc)) for doc in split.docs])
logger.error(f"Split {i} Token Count: {split_token_count}")
raise ValueError( raise ValueError(
f"Token count mismatch: {original_token_count} != {split_token_count}" f"Token count mismatch: {original_token_count} != {split_token_count}"
) )
return splits return splits
# TODO: fix to plot split based on token count and final split
def plot_similarity_scores( def plot_similarity_scores(
self, similarities: list[float], split_indices: list[int] self, similarities: list[float], split_indices: list[int]
): ):
try:
from matplotlib import pyplot as plt
except ImportError:
logger.warning("Plotting is disabled. Please `pip install matplotlib`.")
return
if not self.plot_splits: if not self.plot_splits:
return return
plt.figure(figsize=(12, 6)) plt.figure(figsize=(12, 6))
...@@ -213,24 +225,97 @@ class RollingWindowSplitter(BaseSplitter): ...@@ -213,24 +225,97 @@ class RollingWindowSplitter(BaseSplitter):
label="Split" if split_index == split_indices[0] else "", label="Split" if split_index == split_indices[0] else "",
) )
plt.axhline( plt.axhline(
y=self.score_threshold, y=self.calculated_threshold,
color="g", color="g",
linestyle="-.", linestyle="-.",
label="Threshold Similarity Score", label="Threshold Similarity Score",
) )
# Annotating each similarity score
for i, score in enumerate(similarities):
plt.annotate(
f"{score:.2f}", # Formatting to two decimal places
(i, score),
textcoords="offset points",
xytext=(0, 10), # Positioning the text above the point
ha="center",
) # Center-align the text
plt.xlabel("Document Segment Index") plt.xlabel("Document Segment Index")
plt.ylabel("Similarity Score") plt.ylabel("Similarity Score")
plt.title(f"Threshold: {self.score_threshold}", loc="right", fontsize=10) plt.title(
f"Threshold: {self.calculated_threshold} |"
" Window Size: {self.window_size}",
loc="right",
fontsize=10,
)
plt.suptitle("Document Similarity Scores", fontsize=14) plt.suptitle("Document Similarity Scores", fontsize=14)
plt.legend() plt.legend()
plt.show() plt.show()
def plot_sentence_similarity_scores(
self, docs: list[str], threshold: float, window_size: int
):
try:
from matplotlib import pyplot as plt
except ImportError:
logger.warning("Plotting is disabled. Please `pip install matplotlib`.")
return
"""
Computes similarity scores between the average of the last
'window_size' sentences and the next one,
plots a graph of these similarity scores, and prints the first
sentence after a similarity score below
a specified threshold.
"""
sentences = [sentence for doc in docs for sentence in split_to_sentences(doc)]
encoded_sentences = self.encode_documents(sentences)
similarity_scores = []
for i in range(window_size, len(encoded_sentences)):
window_avg_encoding = np.mean(
encoded_sentences[i - window_size : i], axis=0
)
sim_score = np.dot(window_avg_encoding, encoded_sentences[i]) / (
np.linalg.norm(window_avg_encoding)
* np.linalg.norm(encoded_sentences[i])
+ 1e-10
)
similarity_scores.append(sim_score)
plt.figure(figsize=(10, 8))
plt.plot(similarity_scores, marker="o", linestyle="-", color="b")
plt.title("Sliding Window Sentence Similarity Scores")
plt.xlabel("Sentence Index")
plt.ylabel("Similarity Score")
plt.grid(True)
plt.axhline(y=threshold, color="r", linestyle="--", label="Threshold")
plt.show()
for i, score in enumerate(similarity_scores):
if score < threshold:
print(
f"First sentence after similarity score "
f"below {threshold}: {sentences[i + window_size]}"
)
def __call__(self, docs: list[str]) -> list[DocumentSplit]: def __call__(self, docs: list[str]) -> list[DocumentSplit]:
if not docs:
raise ValueError("At least one document is required for splitting.")
if len(docs) == 1:
token_count = tiktoken_length(docs[0])
if token_count > self.max_split_tokens:
logger.warning(
f"Single document exceeds the maximum token limit "
f"of {self.max_split_tokens}. "
"Splitting to sentences before semantically splitting."
)
docs = split_to_sentences(docs[0])
encoded_docs = self.encode_documents(docs) encoded_docs = self.encode_documents(docs)
self.find_optimal_threshold(docs, encoded_docs) self.find_optimal_threshold(docs, encoded_docs)
similarities = self.calculate_similarity_scores(encoded_docs) similarities = self.calculate_similarity_scores(encoded_docs)
split_indices = self.find_split_indices(similarities=similarities) split_indices = self.find_split_indices(similarities=similarities)
splits = self.split_documents(docs, split_indices, similarities) splits = self.split_documents(docs, split_indices, similarities)
self.plot_similarity_scores(similarities, split_indices) self.plot_similarity_scores(similarities, split_indices)
return splits return splits
from typing import List
import regex
import tiktoken
from colorama import Fore, Style
from semantic_router.schema import DocumentSplit
def split_to_sentences(text: str) -> list[str]:
"""
Enhanced regex pattern to split a given text into sentences more accurately.
The enhanced regex pattern includes handling for:
- Direct speech and quotations.
- Abbreviations, initials, and acronyms.
- Decimal numbers and dates.
- Ellipses and other punctuation marks used in informal text.
- Removing control characters and format characters.
Args:
text (str): The text to split into sentences.
Returns:
list: A list of sentences extracted from the text.
"""
regex_pattern = r"""
# Negative lookbehind for word boundary, word char, dot, word char
(?<!\b\w\.\w.)
# Negative lookbehind for single uppercase initials like "A."
(?<!\b[A-Z][a-z]\.)
# Negative lookbehind for abbreviations like "U.S."
(?<!\b[A-Z]\.)
# Negative lookbehind for abbreviations with uppercase letters and dots
(?<!\b\p{Lu}\.\p{Lu}.)
# Negative lookbehind for numbers, to avoid splitting decimals
(?<!\b\p{N}\.)
# Positive lookbehind for punctuation followed by whitespace
(?<=\.|\?|!|:|\.\.\.)\s+
# Positive lookahead for uppercase letter or opening quote at word boundary
(?="?(?=[A-Z])|"\b)
# OR
|
# Splits after punctuation that follows closing punctuation, followed by
# whitespace
(?<=[\"\'\]\)\}][\.!?])\s+(?=[\"\'\(A-Z])
# OR
|
# Splits after punctuation if not preceded by a period
(?<=[^\.][\.!?])\s+(?=[A-Z])
# OR
|
# Handles splitting after ellipses
(?<=\.\.\.)\s+(?=[A-Z])
# OR
|
# Matches and removes control characters and format characters
[\p{Cc}\p{Cf}]+
"""
sentences = regex.split(regex_pattern, text, flags=regex.VERBOSE)
sentences = [sentence.strip() for sentence in sentences if sentence.strip()]
return sentences
def tiktoken_length(text: str) -> int:
tokenizer = tiktoken.get_encoding("cl100k_base")
tokens = tokenizer.encode(text, disallowed_special=())
return len(tokens)
def plot_splits(document_splits: List[DocumentSplit]) -> None:
colors = [Fore.RED, Fore.GREEN, Fore.BLUE, Fore.MAGENTA]
for i, split in enumerate(document_splits):
color = colors[i % len(colors)]
colored_content = f"{color}{split.content}{Style.RESET_ALL}"
print(f"Split {i + 1}, tokens {split.token_count}:")
print(colored_content)
print("-" * 88)
print("\n")
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment