Skip to content
Snippets Groups Projects
user avatar
FaustLyu authored
0323a8fb
History
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

Introduction

This folder is to conduct retrieval-based context compression on LongBench using 3 retrievers.

First, download the LongBench dataset from HuggingFace and save them in ../LongBench/, resulting in the folder structure:

LongBench/
      LongBench/
          data/
              Put raw LongBench data here.
              2wikimqa.jsonl
              ...
      retrieval/
          BM25/
          contriever/
              contriever/: github
                  mcontriever/: huggingface
          embedding/
          README.md: This file.

Usage

Install the requirements with pip: pip install -r requirements.txt

Retrieval

We take contriever method as an example.

  1. Clone contriever from https://github.com/facebookresearch/contriever
  2. Replace the files in contriever directory with contriever/passage_retrieval.py and contriever/generate_passage_embeddings.py
  3. Get mcontriever model from https://huggingface.co/facebook/mcontriever
  4. run mContriever.sh
  5. Each line within the JSONL file is expanded by adding a new item "retrieved", which represents the retrieval outcomes of the original context. These results are sorted according to the retriever's criteria.

Evaluation

We take ChatGLM2-6B-32k as an example. First run pred.py:

python pred.py --model chatglm2-6b-32k --data C200 --top_k 7 

Then evaluate via eval.py:

python eval.py --model chatglm2-6b-32k --data C200_7

Then the evaluation files are in result_chatglm2-6b-32k.