Skip to content
Snippets Groups Projects
README.md 1.67 KiB
Newer Older
bys0318's avatar
bys0318 committed
## Introduction
This folder is to conduct retrieval-based context compression on LongBench using 3 retrievers.
- BM25
- [Contriever](https://github.com/facebookresearch/contriever)
- OpenAI Embedding ([text-embedding-ada-002](https://openai.com/blog/new-and-improved-embedding-model))

First, download the LongBench dataset from HuggingFace and save them in `../LongBench/`, resulting in the folder structure:
```
LongBench/
      LongBench/
          data/
              Put raw LongBench data here.
              2wikimqa.jsonl
              ...
      retrieval/
          BM25/
          contriever/
              contriever/: github
                  mcontriever/: huggingface
          embedding/
          README.md: This file.
```
## Usage

Install the requirements with pip: `pip install -r requirements.txt`

### Retrieval

We take contriever method as an example.
1. Clone contriever from https://github.com/facebookresearch/contriever
2. Replace the files in contriever directory with `contriever/passage_retrieval.py` and `contriever/generate_passage_embeddings.py`
3. Get mcontriever model from https://huggingface.co/facebook/mcontriever
4. run `mContriever.sh`
5. Each line within the JSONL file is expanded by adding a new item "retrieved", which represents the retrieval outcomes of the original context. These results are sorted according to the retriever's criteria.

### Evaluation

We take ChatGLM2-6B-32k as an example. First run [pred.py](pred.py):
```bash
python pred.py --model chatglm2-6b-32k --data C200 --top_k 7 
```
Then evaluate via [eval.py](eval.py):
```bash
python eval.py --model chatglm2-6b-32k --data C200_7
```

Then the evaluation files are in `result_chatglm2-6b-32k`.