Introduction
This folder is to conduct retrieval-based context compression on LongBench using 3 retrievers.
- BM25
- Contriever
- OpenAI Embedding (text-embedding-ada-002)
First, download the LongBench dataset from HuggingFace and save them in ../LongBench/
, resulting in the folder structure:
LongBench/
LongBench/
data/
Put raw LongBench data here.
2wikimqa.jsonl
...
retrieval/
BM25/
contriever/
contriever/: github
mcontriever/: huggingface
embedding/
README.md: This file.
Usage
Install the requirements with pip: pip install -r requirements.txt
Retrieval
We take contriever method as an example.
- Clone contriever from https://github.com/facebookresearch/contriever
- Replace the files in contriever directory with
contriever/passage_retrieval.py
andcontriever/generate_passage_embeddings.py
- Get mcontriever model from https://huggingface.co/facebook/mcontriever
- run
mContriever.sh
- Each line within the JSONL file is expanded by adding a new item "retrieved", which represents the retrieval outcomes of the original context. These results are sorted according to the retriever's criteria.
Evaluation
We take ChatGLM2-6B-32k as an example. First run pred.py:
python pred.py --model chatglm2-6b-32k --data C200 --top_k 7
Then evaluate via eval.py:
python eval.py --model chatglm2-6b-32k --data C200_7
Then the evaluation files are in result_chatglm2-6b-32k
.