cav_sim stands for "cumulative average vector similarity", as in this code the cumulative average of the embedding vectors are compared to the next embedding vector.
"""
"""
\ No newline at end of file
The CAVSimSplitter class is a document splitter that uses the concept of Cumulative Average Vectors (CAV) to determine where to split a sequence of documents based on their semantic similarity.
For example, consider a sequence of documents [A, B, C, D, E, F]. The CAVSimSplitter works as follows:
1. It starts with the first document (A) and calculates the cosine similarity between the embedding of A and the average embedding of the next two documents (B, C) if they exist, or the next one document (B) if only one exists.
- Cosine Similarity: cos_sim(A, avg(B, C))
2. It then moves to the next document (B), calculates the average embedding of the current documents (A, B), and calculates the cosine similarity with the average embedding of the next two documents (C, D) if they exist, or the next one document (C) if only one exists.
3. This process continues, with the average embedding being calculated for the current cumulative documents and the next one or two documents. For example, at document C:
4. If the similarity score between the average embedding of the current cumulative documents and the average embedding of the next one or two documents falls below the specified similarity threshold, a split is triggered. In our example, let's say the similarity score falls below the threshold between the average of documents A, B, C and the average of D, E. The splitter will then create a split, resulting in two groups of documents: [A, B, C] and [D, E].
5. After a split occurs, the process restarts with the next document in the sequence. For example, after the split between C and D, the process restarts with D and calculates the cosine similarity between the embedding of D and the average embedding of the next two documents if they exist.
- Cosine Similarity: cos_sim(D, avg(E, F))
6. Then we start accumulating and averaging from the left again. On the right there is only one more document left, F:
- Cosine Similarity: cos_sim(avg(D, E), F)
7. The process continues until all documents have been processed.
The result is a list of DocumentSplit objects, each representing a group of semantically similar documents.