Skip to content
Snippets Groups Projects
Unverified Commit d8b89500 authored by Haotian Zhang's avatar Haotian Zhang Committed by GitHub
Browse files

add colbert import (#11309)

cr
parent 4f89050c
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/node_postprocessor/ColbertRerank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> <a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/node_postprocessor/ColbertRerank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Colbert Rerank # Colbert Rerank
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙. If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
[Colbert](https://github.com/stanford-futuredata/ColBERT): ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds. [Colbert](https://github.com/stanford-futuredata/ColBERT): ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.
This example shows how we use Colbert-V2 model as a reranker. This example shows how we use Colbert-V2 model as a reranker.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
!pip install llama-index !pip install llama-index
!pip install llama-index-core !pip install llama-index-core
!pip install --quiet transformers torch !pip install --quiet transformers torch
!pip install llama-index-embeddings-openai !pip install llama-index-embeddings-openai
!pip install llama-index-llms-openai !pip install llama-index-llms-openai
!pip install llama-index-postprocessor-colbert-rerank !pip install llama-index-postprocessor-colbert-rerank
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from llama_index.core import ( from llama_index.core import (
VectorStoreIndex, VectorStoreIndex,
SimpleDirectoryReader, SimpleDirectoryReader,
) )
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Download Data Download Data
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
!mkdir -p 'data/paul_graham/' !mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import os import os
os.environ["OPENAI_API_KEY"] = "sk-" os.environ["OPENAI_API_KEY"] = "sk-"
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# load documents # load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data() documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
# build index # build index
index = VectorStoreIndex.from_documents(documents=documents) index = VectorStoreIndex.from_documents(documents=documents)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Retrieve top 10 most relevant nodes, then filter with Colbert Rerank #### Retrieve top 10 most relevant nodes, then filter with Colbert Rerank
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from llama_index.postprocessor.colbert_rerank import ColbertRerank
colbert_reranker = ColbertRerank( colbert_reranker = ColbertRerank(
top_n=5, top_n=5,
model="colbert-ir/colbertv2.0", model="colbert-ir/colbertv2.0",
tokenizer="colbert-ir/colbertv2.0", tokenizer="colbert-ir/colbertv2.0",
keep_retrieval_score=True, keep_retrieval_score=True,
) )
query_engine = index.as_query_engine( query_engine = index.as_query_engine(
similarity_top_k=10, similarity_top_k=10,
node_postprocessors=[colbert_reranker], node_postprocessors=[colbert_reranker],
) )
response = query_engine.query( response = query_engine.query(
"What did Sam Altman do in this essay?", "What did Sam Altman do in this essay?",
) )
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
for node in response.source_nodes: for node in response.source_nodes:
print(node.id_) print(node.id_)
print(node.node.get_content()[:120]) print(node.node.get_content()[:120])
print("reranking score: ", node.score) print("reranking score: ", node.score)
print("retrieval score: ", node.node.metadata["retrieval_score"]) print("retrieval score: ", node.node.metadata["retrieval_score"])
print("**********") print("**********")
``` ```
%% Output %% Output
50157136-f221-4468-83e1-44e289f44cd5 50157136-f221-4468-83e1-44e289f44cd5
When I was dealing with some urgent problem during YC, there was about a 60% chance it had to do with HN, and a 40% chan When I was dealing with some urgent problem during YC, there was about a 60% chance it had to do with HN, and a 40% chan
reranking score: 0.6470144987106323 reranking score: 0.6470144987106323
retrieval score: 0.8309200279065135 retrieval score: 0.8309200279065135
********** **********
87f0d691-b631-4b21-8123-8f71d383046b 87f0d691-b631-4b21-8123-8f71d383046b
Now that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020 Now that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020
reranking score: 0.6377773284912109 reranking score: 0.6377773284912109
retrieval score: 0.8053000783543145 retrieval score: 0.8053000783543145
********** **********
10234ad9-46b1-4be5-8034-92392ac242ed 10234ad9-46b1-4be5-8034-92392ac242ed
It's not that unprestigious types of work are good per se. But when you find yourself drawn to some kind of work despite It's not that unprestigious types of work are good per se. But when you find yourself drawn to some kind of work despite
reranking score: 0.6301894187927246 reranking score: 0.6301894187927246
retrieval score: 0.7975032272825491 retrieval score: 0.7975032272825491
********** **********
bc269bc4-49c7-4804-8575-cd6db47d70b8 bc269bc4-49c7-4804-8575-cd6db47d70b8
It was as weird as it sounds. I resumed all my old patterns, except now there were doors where there hadn't been. Now wh It was as weird as it sounds. I resumed all my old patterns, except now there were doors where there hadn't been. Now wh
reranking score: 0.6282549500465393 reranking score: 0.6282549500465393
retrieval score: 0.8026253284729862 retrieval score: 0.8026253284729862
********** **********
ebd7e351-64fc-4627-8ddd-2681d1ac33f8 ebd7e351-64fc-4627-8ddd-2681d1ac33f8
As Jessica and I were walking home from dinner on March 11, at the corner of Garden and Walker streets, these three thre As Jessica and I were walking home from dinner on March 11, at the corner of Garden and Walker streets, these three thre
reranking score: 0.6245909929275513 reranking score: 0.6245909929275513
retrieval score: 0.7965812262372882 retrieval score: 0.7965812262372882
********** **********
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print(response) print(response)
``` ```
%% Output %% Output
Sam Altman became the second president of Y Combinator after Paul Graham decided to step back from running the organization. Sam Altman became the second president of Y Combinator after Paul Graham decided to step back from running the organization.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
response = query_engine.query( response = query_engine.query(
"Which schools did Paul attend?", "Which schools did Paul attend?",
) )
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
for node in response.source_nodes: for node in response.source_nodes:
print(node.id_) print(node.id_)
print(node.node.get_content()[:120]) print(node.node.get_content()[:120])
print("reranking score: ", node.score) print("reranking score: ", node.score)
print("retrieval score: ", node.node.metadata["retrieval_score"]) print("retrieval score: ", node.node.metadata["retrieval_score"])
print("**********") print("**********")
``` ```
%% Output %% Output
6942863e-dfc5-4a99-b642-967b99b71343 6942863e-dfc5-4a99-b642-967b99b71343
I didn't want to drop out of grad school, but how else was I going to get out? I remember when my friend Robert Morris g I didn't want to drop out of grad school, but how else was I going to get out? I remember when my friend Robert Morris g
reranking score: 0.6333063840866089 reranking score: 0.6333063840866089
retrieval score: 0.7964996889742813 retrieval score: 0.7964996889742813
********** **********
477c5de0-8e05-494e-95cc-e221881fb5c1 477c5de0-8e05-494e-95cc-e221881fb5c1
What I Worked On What I Worked On
February 2021 February 2021
Before college the two main things I worked on, outside of school, were writing and pro Before college the two main things I worked on, outside of school, were writing and pro
reranking score: 0.5930159091949463 reranking score: 0.5930159091949463
retrieval score: 0.7771872700578062 retrieval score: 0.7771872700578062
********** **********
0448df5c-7950-483d-bc63-15e9110da3bc 0448df5c-7950-483d-bc63-15e9110da3bc
[15] We got 225 applications for the Summer Founders Program, and we were surprised to find that a lot of them were from [15] We got 225 applications for the Summer Founders Program, and we were surprised to find that a lot of them were from
reranking score: 0.5160146951675415 reranking score: 0.5160146951675415
retrieval score: 0.7782554326959897 retrieval score: 0.7782554326959897
********** **********
83af8efd-e992-4fd3-ada4-3c4c6f9971a1 83af8efd-e992-4fd3-ada4-3c4c6f9971a1
Much to my surprise, the time I spent working on this stuff was not wasted after all. After we started Y Combinator, I w Much to my surprise, the time I spent working on this stuff was not wasted after all. After we started Y Combinator, I w
reranking score: 0.5005874633789062 reranking score: 0.5005874633789062
retrieval score: 0.7800375923908894 retrieval score: 0.7800375923908894
********** **********
bc269bc4-49c7-4804-8575-cd6db47d70b8 bc269bc4-49c7-4804-8575-cd6db47d70b8
It was as weird as it sounds. I resumed all my old patterns, except now there were doors where there hadn't been. Now wh It was as weird as it sounds. I resumed all my old patterns, except now there were doors where there hadn't been. Now wh
reranking score: 0.4977223873138428 reranking score: 0.4977223873138428
retrieval score: 0.782688582042514 retrieval score: 0.782688582042514
********** **********
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print(response) print(response)
``` ```
%% Output %% Output
Paul attended Cornell University for his graduate studies and later applied to RISD (Rhode Island School of Design) in the US. Paul attended Cornell University for his graduate studies and later applied to RISD (Rhode Island School of Design) in the US.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment