Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
## Introduction
This folder is to conduct retrieval-based context compression on LongBench using 3 retrievers.
- BM25
- [Contriever](https://github.com/facebookresearch/contriever)
- OpenAI Embedding ([text-embedding-ada-002](https://openai.com/blog/new-and-improved-embedding-model))
First, download the LongBench dataset from HuggingFace and save them in `../LongBench/`, resulting in the folder structure:
```
LongBench/
LongBench/
data/
Put raw LongBench data here.
2wikimqa.jsonl
...
retrieval/
BM25/
contriever/
contriever/: github
mcontriever/: huggingface
embedding/
README.md: This file.
```
## Usage
Install the requirements with pip: `pip install -r requirements.txt`
### Retrieval
We take contriever method as an example.
1. Clone contriever from https://github.com/facebookresearch/contriever
2. Replace the files in contriever directory with `contriever/passage_retrieval.py` and `contriever/generate_passage_embeddings.py`
3. Get mcontriever model from https://huggingface.co/facebook/mcontriever
4. run `mContriever.sh`
5. Each line within the JSONL file is expanded by adding a new item "retrieved", which represents the retrieval outcomes of the original context. These results are sorted according to the retriever's criteria.
### Evaluation
We take ChatGLM2-6B-32k as an example. First run [pred.py](pred.py):
```bash
python pred.py --model chatglm2-6b-32k --data C200 --top_k 7
```
Then evaluate via [eval.py](eval.py):
```bash
python eval.py --model chatglm2-6b-32k --data C200_7
```
Then the evaluation files are in `result_chatglm2-6b-32k`.