Skip to content
Snippets Groups Projects
Unverified Commit 4f89050c authored by Haotian Zhang's avatar Haotian Zhang Committed by GitHub
Browse files

Small tweaks for Colbert reranker (#11306)

parent 2e9247d0
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/node_postprocessor/ColbertRerank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> <a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/node_postprocessor/ColbertRerank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Colbert Rerank # Colbert Rerank
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙. If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
[Colbert](https://github.com/stanford-futuredata/ColBERT): ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds. [Colbert](https://github.com/stanford-futuredata/ColBERT): ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.
This example shows how we use Colbert-V2 model as a reranker. This example shows how we use Colbert-V2 model as a reranker.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
!pip install llama-index !pip install llama-index
!pip install llama-index-core !pip install llama-index-core
!pip install --quiet transformers torch !pip install --quiet transformers torch
!pip install llama-index-embeddings-openai !pip install llama-index-embeddings-openai
!pip install llama-index-llms-openai !pip install llama-index-llms-openai
!pip install llama-index-postprocessor-colbert-rerank !pip install llama-index-postprocessor-colbert-rerank
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from llama_index.core import ( from llama_index.core import (
VectorStoreIndex, VectorStoreIndex,
SimpleDirectoryReader, SimpleDirectoryReader,
) )
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Download Data Download Data
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
!mkdir -p 'data/paul_graham/' !mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import os import os
os.environ["OPENAI_API_KEY"] = "sk-" os.environ["OPENAI_API_KEY"] = "sk-"
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# load documents # load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data() documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
# build index # build index
index = VectorStoreIndex.from_documents(documents=documents) index = VectorStoreIndex.from_documents(documents=documents)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Retrieve top 10 most relevant nodes, then filter with Colbert Rerank #### Retrieve top 10 most relevant nodes, then filter with Colbert Rerank
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
colbert_reranker = ColbertRerank( colbert_reranker = ColbertRerank(
top_n=5, top_n=5,
model="colbert-ir/colbertv2.0", model="colbert-ir/colbertv2.0",
tokenizer="colbert-ir/colbertv2.0", tokenizer="colbert-ir/colbertv2.0",
keep_retrieval_score=True, keep_retrieval_score=True,
) )
query_engine = index.as_query_engine( query_engine = index.as_query_engine(
similarity_top_k=5, similarity_top_k=10,
node_postprocessors=[colbert_reranker], node_postprocessors=[colbert_reranker],
) )
response = query_engine.query( response = query_engine.query(
"What did Sam Altman do in this essay?", "What did Sam Altman do in this essay?",
) )
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
for node in response.source_nodes: for node in response.source_nodes:
print(node.id_) print(node.id_)
print(node.node.get_content()[:120]) print(node.node.get_content()[:120])
print("reranking score: ", node.score) print("reranking score: ", node.score)
print("retrieval score: ", node.node.metadata["retrieval_score"]) print("retrieval score: ", node.node.metadata["retrieval_score"])
print("**********") print("**********")
``` ```
%% Output %% Output
bd5a8323-41bb-4cde-8b2b-2ac69b1e519a 50157136-f221-4468-83e1-44e289f44cd5
When I was dealing with some urgent problem during YC, there was about a 60% chance it had to do with HN, and a 40% chan When I was dealing with some urgent problem during YC, there was about a 60% chance it had to do with HN, and a 40% chan
reranking score: 0.6470144987106323 reranking score: 0.6470144987106323
retrieval score: 0.8309415059604792 retrieval score: 0.8309200279065135
********** **********
24c6c722-bfd0-42e0-9e44-663253b79aa2 87f0d691-b631-4b21-8123-8f71d383046b
Now that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020 Now that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020
reranking score: 0.6377773284912109 reranking score: 0.6377773284912109
retrieval score: 0.8053894057548092 retrieval score: 0.8053000783543145
********** **********
e572465c-d48c-48ce-9664-99ddf09cdae6 10234ad9-46b1-4be5-8034-92392ac242ed
Much to my surprise, the time I spent working on this stuff was not wasted after all. After we started Y Combinator, I w It's not that unprestigious types of work are good per se. But when you find yourself drawn to some kind of work despite
reranking score: 0.6206888556480408 reranking score: 0.6301894187927246
retrieval score: 0.8091076626532405 retrieval score: 0.7975032272825491
**********
bc269bc4-49c7-4804-8575-cd6db47d70b8
It was as weird as it sounds. I resumed all my old patterns, except now there were doors where there hadn't been. Now wh
reranking score: 0.6282549500465393
retrieval score: 0.8026253284729862
**********
ebd7e351-64fc-4627-8ddd-2681d1ac33f8
As Jessica and I were walking home from dinner on March 11, at the corner of Garden and Walker streets, these three thre
reranking score: 0.6245909929275513
retrieval score: 0.7965812262372882
**********
%% Cell type:code id: tags:
``` python
print(response)
```
%% Output
Sam Altman became the second president of Y Combinator after Paul Graham decided to step back from running the organization.
%% Cell type:code id: tags:
``` python
response = query_engine.query(
"Which schools did Paul attend?",
)
```
%% Cell type:code id: tags:
``` python
for node in response.source_nodes:
print(node.id_)
print(node.node.get_content()[:120])
print("reranking score: ", node.score)
print("retrieval score: ", node.node.metadata["retrieval_score"])
print("**********")
```
%% Output
6942863e-dfc5-4a99-b642-967b99b71343
I didn't want to drop out of grad school, but how else was I going to get out? I remember when my friend Robert Morris g
reranking score: 0.6333063840866089
retrieval score: 0.7964996889742813
********** **********
576168dd-98ce-43ee-91d4-fef0fb4368d2 477c5de0-8e05-494e-95cc-e221881fb5c1
What I Worked On
February 2021
Before college the two main things I worked on, outside of school, were writing and pro
reranking score: 0.5930159091949463
retrieval score: 0.7771872700578062
**********
0448df5c-7950-483d-bc63-15e9110da3bc
[15] We got 225 applications for the Summer Founders Program, and we were surprised to find that a lot of them were from [15] We got 225 applications for the Summer Founders Program, and we were surprised to find that a lot of them were from
reranking score: 0.6143158674240112 reranking score: 0.5160146951675415
retrieval score: 0.8069205604148549 retrieval score: 0.7782554326959897
**********
83af8efd-e992-4fd3-ada4-3c4c6f9971a1
Much to my surprise, the time I spent working on this stuff was not wasted after all. After we started Y Combinator, I w
reranking score: 0.5005874633789062
retrieval score: 0.7800375923908894
********** **********
d0f00ad3-b162-49d7-a01f-c513c6c068ad bc269bc4-49c7-4804-8575-cd6db47d70b8
Up till that point YC had been controlled by the original LLC we four had started. But we wanted YC to last for a long t It was as weird as it sounds. I resumed all my old patterns, except now there were doors where there hadn't been. Now wh
reranking score: 0.5917402505874634 reranking score: 0.4977223873138428
retrieval score: 0.8230686425302381 retrieval score: 0.782688582042514
********** **********
%% Cell type:code id: tags:
``` python
print(response)
```
%% Output
Paul attended Cornell University for his graduate studies and later applied to RISD (Rhode Island School of Design) in the US.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment