Skip to content
Snippets Groups Projects
Unverified Commit 7f9e0505 authored by Siraj R Aizlewood's avatar Siraj R Aizlewood
Browse files

Code for Cumulative Average Vector Similarity Splitter

parent 1f3c080d
Branches
Tags
No related merge requests found
from typing import List
from semantic_router.splitters import BaseSplitter
import numpy as np
from semantic_router.utils import DocumentSplit
""" class CAVSimSplitter(BaseSplitter):
cav_sim stands for "cumulative average vector similarity", as in this code the cumulative average of the embedding vectors are compared to the next embedding vector.
""" """
\ No newline at end of file The CAVSimSplitter class is a document splitter that uses the concept of Cumulative Average Vectors (CAV) to determine where to split a sequence of documents based on their semantic similarity.
For example, consider a sequence of documents [A, B, C, D, E, F]. The CAVSimSplitter works as follows:
1. It starts with the first document (A) and calculates the cosine similarity between the embedding of A and the average embedding of the next two documents (B, C) if they exist, or the next one document (B) if only one exists.
- Cosine Similarity: cos_sim(A, avg(B, C))
2. It then moves to the next document (B), calculates the average embedding of the current documents (A, B), and calculates the cosine similarity with the average embedding of the next two documents (C, D) if they exist, or the next one document (C) if only one exists.
- Cosine Similarity: cos_sim(avg(A, B), avg(C, D))
3. This process continues, with the average embedding being calculated for the current cumulative documents and the next one or two documents. For example, at document C:
- Cosine Similarity: cos_sim(avg(A, B, C), avg(D, E))
4. If the similarity score between the average embedding of the current cumulative documents and the average embedding of the next one or two documents falls below the specified similarity threshold, a split is triggered. In our example, let's say the similarity score falls below the threshold between the average of documents A, B, C and the average of D, E. The splitter will then create a split, resulting in two groups of documents: [A, B, C] and [D, E].
5. After a split occurs, the process restarts with the next document in the sequence. For example, after the split between C and D, the process restarts with D and calculates the cosine similarity between the embedding of D and the average embedding of the next two documents if they exist.
- Cosine Similarity: cos_sim(D, avg(E, F))
6. Then we start accumulating and averaging from the left again. On the right there is only one more document left, F:
- Cosine Similarity: cos_sim(avg(D, E), F)
7. The process continues until all documents have been processed.
The result is a list of DocumentSplit objects, each representing a group of semantically similar documents.
"""
def __init__(
self,
docs: List[str],
name: str = "cav_similarity_splitter",
similarity_threshold: float = 0.45,
):
super().__init__(
docs=docs,
name=name,
similarity_threshold=similarity_threshold,
)
def __call__(self):
total_docs = len(self.docs)
splits = []
curr_split_start_idx = 0
curr_split_num = 1
for idx in range(1, total_docs):
curr_split_docs_embeds = self.encoder(self.docs[curr_split_start_idx : idx + 1])
avg_embedding = np.mean(curr_split_docs_embeds, axis=0)
# Compute the average embedding for the next two documents, if available
if idx + 3 <= total_docs: # Check if the next two indices are within the range
next_doc_embeds = self.encoder(self.docs[idx + 1 : idx + 3])
next_avg_embed = np.mean(next_doc_embeds, axis=0)
elif idx + 2 <= total_docs: # Check if the next index is within the range
next_avg_embed = self.encoder([self.docs[idx + 1]])[0]
else:
next_avg_embed = None
if next_avg_embed is not None:
curr_sim_score = np.dot(avg_embedding, next_avg_embed) / (
np.linalg.norm(avg_embedding)
* np.linalg.norm(next_avg_embed)
)
if curr_sim_score < self.similarity_threshold:
splits.append(
DocumentSplit(
docs=list(self.docs[curr_split_start_idx : idx + 1]),
is_triggered=True,
triggered_score=curr_sim_score,
)
)
curr_split_start_idx = idx + 1
curr_split_num += 1
splits.append(DocumentSplit(docs=list(self.docs[curr_split_start_idx:])))
return splits
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment