Skip to content
Snippets Groups Projects
Commit f6227eb7 authored by bys0318's avatar bys0318
Browse files

init

parents
No related branches found
No related tags found
No related merge requests found
LICENSE 0 → 100644
MIT License
Copyright (c) 2023 Yushi Bai
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
README.md 0 → 100644
![](misc/logo.gif)
<p align="center">
🤗 <a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank">HF Repo</a> • 📃 Paper coming soon!
</p>
Read this in [English](README_EN.md).
# LongBench: 多语言、多任务长文本理解评测基准
**LongBench**是第一个多语言、多任务、全面评估大语言模型**长文本理解**能力的综合数据集。在目前大模型多语言能力引起广泛关注的背景下,LongBench涵盖了不同的语言(中文和英文),以此来对大模型在长文本下的多语言能力进行更全面的评估。同时,LongBench由二十个不同的任务组成,覆盖了单文档QA、多文档QA、摘要、Few-shot学习、代码补全和合成任务等关键的长文本应用场景。
我们深知模型评测过程中可能产生的高昂成本,尤其是长文本场景下(如人工标注成本或API调用成本)。因此,我们采用了一种全自动的评测方式,旨在以最低的成本,最有效地衡量和评估模型的长文本理解能力。
LongBench包含13个英文任务、5个中文任务和2个代码任务,多数任务的平均长度在5k-15k之间,共包含约4500条测试数据。从主要任务分类上,LongBench包含六大类任务,即单文档QA、多文档QA、摘要、Few-shot学习、代码补全和合成任务。关于LongBench数据集的具体统计及任务构造方式请参考[这里](task.md)
| 任务类型 | 英文任务数 | 中文任务数 | 代码任务数 |
| :----------: | :--------: | :--------: | :--------: |
| 单文档QA | 3 | 1 | - |
| 多文档QA | 3 | 1 | - |
| 摘要 | 2 | 1 | - |
| Few-shot学习 | 3 | 1 | - |
| 合成任务 | 2 | 1 | - |
| 代码补全 | - | - | 2 |
## 排行榜
我们在这里展示了所有模型在Zero-shot场景下,在中文和英文各大类任务上得分的平均值(%),各任务评估所用指标请参考[这里](task.md)
#### 英文榜单
| | Avg | 单文档QA | 多文档QA | 摘要 | Few-shot学习 | 代码补全 | 合成任务 |
| --- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
| GPT-3.5-Turbo-16k | 45.5 | 39.8 | 38.7 | 26.5 | 76.0 | 54.5 | 37.8 |
| Llama2-7B-chat-4k | 29.0 | 24.8 | 21.4 | 23.9 | 50.5 | 47.3 | 5.9 |
| LongChat-7B-16k | 33.7 | 29.3 | 16.1 | 25.8 | 59.9 | 57.0 | 14.2 |
| XGen-7B-8k | 28.7 | 24.5 | 20.4 | 24.8 | 58.7 | 38.0 | 5.6 |
| InternLM-7B-8k | 24.7 | 17.1 | 20.8 | 13.3 | 52.7 | 39.7 | 4.7 |
| ChatGLM2-6B | 26.0 | 23.1 | 15.0 | 22.9 | 46.1 | 46.1 | 2.7 |
| ChatGLM2-6B-32k | 42.7 | 32.8 | 34.0 | 28.6 | 68.1 | 52.7 | 39.8 |
#### 中文榜单
| | Avg | 单文档QA | 多文档QA | 摘要 | Few-shot学习 | 代码补全 | 合成任务 |
|-------|:---:|:-------------:|:------------:|:-------------:|:-----------------:|:---------------:|:----------------:|
| GPT-3.5-Turbo-16k | 44.5 | 61.2 | 28.7 | 16.0 | 29.2 | 54.5 | 77.5 |
| Llama2-7B-chat-4k | 13.5 | 11.6 | 1.9 | 0.2 | 19.8 | 47.3 | 0.5 |
| LongChat-7B-16k | 23.7 | 26.6 | 19.1 | 14.0 | 20.8 | 57.0 | 4.8 |
| XGen-7B-8k | 14.5 | 14.2 | 9.1 | 1.5 | 20.0 | 38.0 | 4.2 |
| InternLM-7B-8k | 18.6 | 33.3 | 8.9 | 13.0 | 15.5 | 39.7 | 0.9 |
| ChatGLM2-6B | 22.5 | 33.0 | 15.2 | 14.6 | 20.5 | 46.1 | 5.5 |
| ChatGLM2-6B-32k | 41.3 | 52.0 | 34.3 | 16.3 | 29.9 | 52.7 | 62.5 |
#### 长文本任务能力雷达图
![](misc/radar.png)
#### 不同长度文本下的能力变化
为了更有针对性地分析模型在不同文本长度下的相对表现,下图展示了模型在不同文本长度区间上,所有任务上的平均相对分数。
![](misc/curve.png)
> 注:假设模型在某个任务的特定长度范围内数据上得分为x,在该任务所有数据上得分为y,则模型在该长度范围的**相对分数**为(x/y-1)。为了更好比较不同模型的变化趋势,我们在0-4k将所有折线平移至0。
## 如何在LongBench上评测模型
#### 载入数据
你可以通过Hugging Face datasets来下载并载入**LongBench**的数据([🤗 HF Repo](https://huggingface.co/datasets/THUDM/LongBench)):
```python
from datasets import load_dataset
datasets = ["hotpotqa", "2wikimqa", "musique", "dureader", "narrativeqa", "qasper", "multifieldqa_en", \
"multifieldqa_zh", "gov_report", "qmsum", "vcsum", "trec", "nq", "triviaqa", "lsht", "passage_count", \
"passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"]
for dataset in datasets:
data = load_dataset('THUDM/LongBench', dataset, split='test')
```
同样地,你也可以直接用这个[链接](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip)下载所有的评测数据。
#### 数据格式
**LongBench**中所有数据都统一为以下格式:
```json
{
"input": "任务的输入/指令,通常较短,比如QA中的问题、Few-shot任务中的提问等",
"context": "任务所需的长语境文本,比如文档、跨文件代码、Few-shot任务中的few-shot样本",
"answers": "由所有标准答案组成的列表",
"length": "前三项文本的总长度(中、英文分别用字、词数统计)",
"dataset": "本条数据所属数据集名称",
"language": "本条数据的语言",
"all_classes": "分类任务中的所有类别,非分类任务则为null",
"_id": "每条数据的随机id"
}
```
#### 评测
我们以ChatGLM2-6B为例提供了一份评测代码。首先,运行仓库下的[pred.py](pred.py)
```bash
CUDA_VISIBLE_DEVICES=0 python pred.py
```
可以在`pred/`文件夹下得到模型在所有数据集下的输出,此后运行[eval.py](eval.py)的评测代码:
```bash
python eval.py
```
可以在`result.json`中得到在各数据集上的评测结果。请注意,我们在`config/`下提供了我们总结出来的在各数据集上适合的输入格式和最大输出长度限制,在评测的时候可以进行修改以更好地适用你要评测的模型,修改后在[pred.py](pred.py)评测时会自动按照新的格式去整理数据并得到对应的模型输出。
## 详细评测结果
下面的几张表格展示了模型在所有子任务数据集上的Zero-shot评测结果(%),其中的中文数据集以“zh”标示(各任务评估所用指标请参考[这里](task.md))。
> 注:对于超出模型处理长度能力的文本,参考[Lost in the Middle](https://arxiv.org/abs/2307.03172)的观察,我们从文本中间进行截断,保持前后部分的信息。实验表明,这种截断方式对模型性能影响最小。
#### 单文档QA
| | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh |
|-------------------|:-----------:|:------:|:---------------:|:---------------:|
| GPT-3.5-Turbo-16k | 23.6 | 43.3 | 52.3 | 61.2 |
| Llama2-7B-chat-4k | 19.1 | 19.6 | 35.8 | 11.6 |
| LongChat-7B-16k | 21.6 | 21.6 | 44.6 | 26.6 |
| XGen-7B-8k | 17.9 | 18.3 | 37.2 | 14.2 |
| InternLM-7B-8k | 12.4 | 16.8 | 22.3 | 33.3 |
| ChatGLM2-6B | 11.2 | 23.7 | 34.2 | 33.0 |
| ChatGLM2-6B-32k | 20.4 | 32.2 | 45.7 | 52.0 |
#### 多文档QA
| | HotpotQA | 2WikiMQA | Musique | DuReader (zh) |
|----------------------|:--------:|:--------:|:-------:|:--------:|
| GPT-3.5-Turbo-16k | 51.6 | 37.7 | 26.9 | 28.7 |
| Llama2-7B-chat-4k | 24.3 | 31.4 | 8.6 | 1.9 |
| LongChat-7B-16k | 22.4 | 16.8 | 9.1 | 19.1 |
| XGen-7B-8k | 28.3 | 21.5 | 11.5 | 9.1 |
| InternLM-7B-8k | 27.9 | 24.0 | 10.3 | 8.9 |
| ChatGLM2-6B | 20.2 | 19.6 | 5.3 | 15.2 |
| ChatGLM2-6B-32k | 44.9 | 34.9 | 22.2 | 34.3 |
#### 摘要
| | GovReport | QMSum | VCSUM (zh) |
|:-----------|:---------:|:-----:|:-----:|
| GPT-3.5-Turbo-16k | 29.5 | 23.4 | 16.0 |
| Llama2-7B-chat-4k | 27.3 | 20.6 | 0.2 |
| LongChat-7B-16k | 28.4 | 23.2 | 14.0 |
| XGen-7B-8k | 27.8 | 21.7 | 1.5 |
| InternLM-7B-8k | 9.8 | 16.8 | 13.0 |
| ChatGLM2-6B | 23.7 | 22.2 | 14.6 |
| ChatGLM2-6B-32k | 33.3 | 23.9 | 16.3 |
#### Few-shot学习
| | TREC | NQ | TriviaQA | LSHT (zh) |
| --- | :-: | :-: | :-: | :-: |
| GPT-3.5-Turbo-16k | 68.0 | 73.0 | 87.1 | 29.2 |
| Llama2-7B-chat-4k | 60.5 | 31.4 | 59.7 | 19.8 |
| LongChat-7B-16k | 61.5 | 44.8 | 73.5 | 20.8 |
| XGen-7B-8k | 66.0 | 43.2 | 67.0 | 20.0 |
| InternLM-7B-8k | 49.0 | 47.6 | 61.6 | 15.5 |
| ChatGLM2-6B | 44.0 | 34.5 | 59.8 | 20.5 |
| ChatGLM2-6B-32k | 62.0 | 64.9 | 77.6 | 29.9 |
#### 代码补全
| | LCC | RepoBench-P |
| --- | :-: | :-: |
| GPT-3.5-Turbo-16k | 54.7 | 54.3 |
| Llama2-7B-chat-4k | 52.3 | 42.4 |
| LongChat-7B-16k | 59.2 | 54.7 |
| XGen-7B-8k | 38.8 | 37.3 |
| InternLM-7B-8k | 45.5 | 34.0 |
| ChatGLM2-6B | 48.4 | 43.7 |
| ChatGLM2-6B-32k | 55.4 | 50.0 |
#### 合成任务
| | PassageRetrieval-en | Passage Count | PassageRetrieval-zh |
| --- | :-: | :-: | :-: |
| GPT-3.5-Turbo-16k | 71.0 | 4.5 | 77.5 |
| Llama2-7B-chat-4k | 9.2 | 2.5 | 0.5 |
| LongChat-7B-16k | 24.0 | 4.5 | 4.8 |
| XGen-7B-8k | 9.0 | 2.2 | 4.2 |
| InternLM-7B-8k | 6.5 | 2.9 | 0.9 |
| ChatGLM2-6B | 3.2 | 2.1 | 5.5 |
| ChatGLM2-6B-32k | 77.5 | 2.0 | 62.5 |
## 致谢
- **LongBench**的部分任务基于之前的研究者提出的数据集构建,包括[HotpotQA](https://hotpotqa.github.io/)[2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/)[Musique](https://arxiv.org/abs/2108.00573)[DuReader](https://github.com/baidu/DuReader)[NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf)[Qasper](https://arxiv.org/pdf/2105.03011.pdf)[GovReport](https://arxiv.org/pdf/2104.02112.pdf)[QMSum](https://arxiv.org/pdf/2104.05938.pdf)[VCSUM](https://arxiv.org/abs/2305.05280)[TriviaQA](https://nlp.cs.washington.edu/triviaqa/)[NQ](https://ai.google.com/research/NaturalQuestions/)[TREC](https://aclanthology.org/C02-1150.pdf)[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf)[LCC](https://arxiv.org/abs/2306.14893)[RepoBench-P](https://arxiv.org/abs/2306.03091)
## 引用
本工作由**THUKEG****Zhipu AI**共同完成,相关论文正在撰写中,届时将更新引用信息,敬请关注~
如果您使用该Benchmark,可以一并引用LongBench所基于的数据集对应的论文,相关引用信息列举在[这里](refs/ref.bib)
![](misc/logo.gif)
<p align="center">
🤗 <a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank">HF Repo</a> • 📃 Paper coming soon!
</p>
阅读 [中文版本](README.md)
# LongBench: A Multilingual, Multitask Benchmark Tailored for Long Context Understanding
**LongBench** is the first comprehensive dataset for multi-language, multi-task, and comprehensive assessment of **long text understanding** capabilities of large language models. In the context of the widespread attention to the multi-language capabilities of large models, LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models' multi-language capabilities in long texts. In addition, LongBench consists of twenty different tasks, covering key long-text application scenarios such as single-document QA, multi-document QA, summaries, few-shot learning, code completion, and synthesis tasks.
We are fully aware of the potentially high costs involved in the model evaluation process, especially in the context of long-text scenarios (such as manual annotation costs or API call costs). Therefore, we have adopted a fully automated evaluation method, aimed at measuring and evaluating the model's ability to understand long texts at the lowest cost and most effectively.
LongBench includes 13 English tasks, 5 Chinese tasks, and 2 code tasks, with the average length of most tasks ranging from 5k to 15k. From the main task categories, LongBench includes six types of tasks, namely multi-document QA, single-document QA, summaries, Few-shot learning, synthetic tasks, and code completion. For detailed statistics and construction methods of LongBench tasks, please refer [here](task_en.md).
| Task Type | \#English Task | \#Chinese Task | \#Code Task |
| :-------: | :--------------------: | :--------------------: | :------------------: |
| Multi-document QA | 3 | 1 | - |
| Single-document QA | 3 | 1 | - |
| Summarization | 2 | 1 | - |
| Few-shot learning | 3 | 1 | - |
| Synthetic Tasks | 2 | 1 | - |
| Code Completion | - | - | 2 |
## Leaderboard
Here is the average score (%) of all models on various major tasks in both Chinese and English languages under the Zero-shot scenario. Please refer to this [link](task_en.md) for the evaluation metrics used for each task.
#### English
| | Avg | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks |
| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: |
| GPT-3.5-Turbo-16k | 45.5 | 39.8 | 38.7 | 26.5 | 76.0 | 54.5 | 37.8 |
| Llama2-7B-chat-4k | 29.0 | 24.8 | 21.4 | 23.9 | 50.5 | 47.3 | 5.9 |
| LongChat-7B-16k | 33.7 | 29.3 | 16.1 | 25.8 | 59.9 | 57.0 | 14.2 |
| XGen-7B-8k | 28.7 | 24.5 | 20.4 | 24.8 | 58.7 | 38.0 | 5.6 |
| InternLM-7B-8k | 24.7 | 17.1 | 20.8 | 13.3 | 52.7 | 39.7 | 4.7 |
| ChatGLM2-6B | 26.0 | 23.1 | 15.0 | 22.9 | 46.1 | 46.1 | 2.7 |
| ChatGLM2-6B-32k | 42.7 | 32.8 | 34.0 | 28.6 | 68.1 | 52.7 | 39.8 |
#### Chinese
| | Avg | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks |
| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: |
| GPT-3.5-Turbo-16k | 44.5 | 61.2 | 28.7 | 16.0 | 29.2 | 54.5 | 77.5 |
| Llama2-7B-chat-4k | 13.5 | 11.6 | 1.9 | 0.2 | 19.8 | 47.3 | 0.5 |
| LongChat-7B-16k | 23.7 | 26.6 | 19.1 | 14.0 | 20.8 | 57.0 | 4.8 |
| XGen-7B-8k | 14.5 | 14.2 | 9.1 | 1.5 | 20.0 | 38.0 | 4.2 |
| InternLM-7B-8k | 18.6 | 33.3 | 8.9 | 13.0 | 15.5 | 39.7 | 0.9 |
| ChatGLM2-6B | 22.5 | 33.0 | 15.2 | 14.6 | 20.5 | 46.1 | 5.5 |
| ChatGLM2-6B-32k | 41.3 | 52.0 | 34.3 | 16.3 | 29.9 | 52.7 | 62.5 |
#### Radar Chart of Long Text Task Capability
![](misc/radar.png)
#### Variation of Abilities under Different Text Lengths
To more specifically analyze the model's relative performance under different text lengths, the following chart shows the average relative scores on all tasks over different text length intervals.
![](misc/curve.png)
> Note: Assume that the model scores x on the data within a specific length range of a task, and y on all data of that task, then the model's **relative score** for that length range is (x/y-1). To better compare the trends of different models, we shift all the lines to 0 from 0-4k.
## How to evaluate models on LongBench
#### Loading Data
You can download and load the **LongBench** data through the Hugging Face datasets ([🤗 HF Repo](https://huggingface.co/datasets/THUDM/LongBench)):
```python
from datasets import load_dataset
datasets = ["hotpotqa", "2wikimqa", "musique", "dureader", "narrativeqa", "qasper", "multifieldqa_en", \
"multifieldqa_zh", "gov_report", "qmsum", "vcsum", "trec", "nq", "triviaqa", "lsht", "passage_count", \
"passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"]
for dataset in datasets:
data = load_dataset('THUDM/LongBench', dataset, split='test')
```
Alternatively, you can download the folder from [this link](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip) to load the data.
#### Data Format
All data in **LongBench** are standardized to the following format:
```json
{
"input": "The input/command for the task, usually short, such as questions in QA, queries in Few-shot tasks, etc.",
"context": "The long context text required for the task, such as documents, cross-file code, few-shot samples in Few-shot tasks",
"answers": "List composed of all standard answers",
"length": "Total length of the first three items of text (counted in characters for Chinese and words for English)",
"dataset": "The name of the dataset to which this piece of data belongs",
"language": "The language of this piece of data",
"all_classes": "All categories in classification tasks, null for non-classification tasks",
"_id": "Random id for each piece of data"
}
```
#### Evaluation
We provide an evaluation code using ChatGLM2-6B as an example. Firstly, run the [pred.py](pred.py) under the repository:
```bash
CUDA_VISIBLE_DEVICES=0 python pred.py
```
You can get the model outputs on all datasets in the `pred/` folder. After that, run the evaluation code of [eval.py](eval.py):
```bash
python eval.py
```
You can get the evaluation results on various datasets in `result.json`. Please note that we provide the input format suitable for each dataset and the maximum output length limit we summarized in `config/`. You can modify them during the evaluation to better suit the model you want to evaluate. After modification, when evaluating with [pred.py](pred.py), the data will be automatically organized according to the new format to get the corresponding model output.
## Evaluation Result on Each Dataset
The following tables display the Zero-shot evaluation results (%) of the model on all subtask datasets, where Chinese datasets are denoted by "zh" (please refer to this [link](task_en.md) for the evaluation metrics used for each task).
#### Single-Document QA
| | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh |
| ----------------- | :---------: | :----: | :-------------: | :-------------: |
| GPT-3.5-Turbo-16k | 23.6 | 43.3 | 52.3 | 61.2 |
| Llama2-7B-chat-4k | 19.1 | 19.6 | 35.8 | 11.6 |
| LongChat-7B-16k | 21.6 | 21.6 | 44.6 | 26.6 |
| XGen-7B-8k | 17.9 | 18.3 | 37.2 | 14.2 |
| InternLM-7B-8k | 12.4 | 16.8 | 22.3 | 33.3 |
| ChatGLM2-6B | 11.2 | 23.7 | 34.2 | 33.0 |
| ChatGLM2-6B-32k | 20.4 | 32.2 | 45.7 | 52.0 |
#### Multi-Document QA
| | HotpotQA | 2WikiMQA | Musique | DuReader (zh) |
| ----------------- | :------: | :------: | :-----: | :-----------: |
| GPT-3.5-Turbo-16k | 51.6 | 37.7 | 26.9 | 28.7 |
| Llama2-7B-chat-4k | 24.3 | 31.4 | 8.6 | 1.9 |
| LongChat-7B-16k | 22.4 | 16.8 | 9.1 | 19.1 |
| XGen-7B-8k | 28.3 | 21.5 | 11.5 | 9.1 |
| InternLM-7B-8k | 27.9 | 24.0 | 10.3 | 8.9 |
| ChatGLM2-6B | 20.2 | 19.6 | 5.3 | 15.2 |
| ChatGLM2-6B-32k | 44.9 | 34.9 | 22.2 | 34.3 |
#### Summarization
| | GovReport | QMSum | VCSUM (zh) |
| :---------------- | :-------: | :---: | :--------: |
| GPT-3.5-Turbo-16k | 29.5 | 23.4 | 16.0 |
| Llama2-7B-chat-4k | 27.3 | 20.6 | 0.2 |
| LongChat-7B-16k | 28.4 | 23.2 | 14.0 |
| XGen-7B-8k | 27.8 | 21.7 | 1.5 |
| InternLM-7B-8k | 9.8 | 16.8 | 13.0 |
| ChatGLM2-6B | 23.7 | 22.2 | 14.6 |
| ChatGLM2-6B-32k | 33.3 | 23.9 | 16.3 |
#### Few-shot Learning
| | TREC | NQ | TriviaQA | LSHT (zh) |
| ----------------- | :--: | :--: | :------: | :-------: |
| GPT-3.5-Turbo-16k | 68.0 | 73.0 | 87.1 | 29.2 |
| Llama2-7B-chat-4k | 60.5 | 31.4 | 59.7 | 19.8 |
| LongChat-7B-16k | 61.5 | 44.8 | 73.5 | 20.8 |
| XGen-7B-8k | 66.0 | 43.2 | 67.0 | 20.0 |
| InternLM-7B-8k | 49.0 | 47.6 | 61.6 | 15.5 |
| ChatGLM2-6B | 44.0 | 34.5 | 59.8 | 20.5 |
| ChatGLM2-6B-32k | 62.0 | 64.9 | 77.6 | 29.9 |
#### Code Completion
| | LCC | RepoBench-P |
| ----------------- | :--: | :---------: |
| GPT-3.5-Turbo-16k | 54.7 | 54.3 |
| Llama2-7B-chat-4k | 52.3 | 42.4 |
| LongChat-7B-16k | 59.2 | 54.7 |
| XGen-7B-8k | 38.8 | 37.3 |
| InternLM-7B-8k | 45.5 | 34.0 |
| ChatGLM2-6B | 48.4 | 43.7 |
| ChatGLM2-6B-32k | 55.4 | 50.0 |
#### Synthetic Tasks
| | PassageRetrieval-en | Passage Count | PassageRetrieval-zh |
| ----------------- | :-----------------: | :-----------: | :-----------------: |
| GPT-3.5-Turbo-16k | 71.0 | 4.5 | 77.5 |
| Llama2-7B-chat-4k | 9.2 | 2.5 | 0.5 |
| LongChat-7B-16k | 24.0 | 4.5 | 4.8 |
| XGen-7B-8k | 9.0 | 2.2 | 4.2 |
| InternLM-7B-8k | 6.5 | 2.9 | 0.9 |
| ChatGLM2-6B | 3.2 | 2.1 | 5.5 |
| ChatGLM2-6B-32k | 77.5 | 2.0 | 62.5 |
## Acknowledgements
- Some of the tasks of **LongBench** are based on the datasets proposed by previous researchers, including [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), [DuReader](https://github.com/baidu/DuReader), [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), [QMSum](https://arxiv.org/pdf/2104.05938.pdf), [VCSUM](https://arxiv.org/abs/2305.05280), [TriviaQA](https://nlp.cs.washington.edu/triviaqa/), [NQ](https://ai.google.com/research/NaturalQuestions/), [TREC](https://aclanthology.org/C02-1150.pdf), [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf), [LCC](https://arxiv.org/abs/2306.14893) and [RepoBench-P](https://arxiv.org/abs/2306.03091).
## Citation
This work is jointly completed by **THUKEG** and **Zhipu AI**. The related paper is currently being written, and the citation information will be updated when it's ready. Please stay tuned~
If you use this benchmark, you can also cite the papers corresponding to the datasets that LongBench is based on. The relevant citation information is listed [here](refs/ref.bib).
\ No newline at end of file
{
"passage_count": 32,
"passage_retrieval": 32,
"trec": 64,
"nq": 32,
"triviaqa": 32,
"hotpotqa": 32,
"musique": 32,
"2wikimqa": 32,
"narrativeqa": 128,
"qasper": 128,
"gov_report": 512,
"qmsum": 512,
"passkey_retrieval": 32,
"lsht": 64,
"dureader": 128,
"vcsum": 512,
"multifieldqa_en": 64,
"multifieldqa_zh": 64,
"lcc": 64,
"repobench-p": 64
}
\ No newline at end of file
{
"passage_count": "There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n{context}\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: ",
"passage_retrieval": "Here are 30 paragraphs from Wikipedia, along with an abstract. Please determine which paragraph the abstract is from.\n\n{context}\n\nThe following is an abstract.\n\n{input}\n\nPlease enter the number of the paragraph that the abstract is from. The answer format must be like \"Paragraph 1\", \"Paragraph 2\", etc.\n\nThe answer is: ",
"trec": "Please determine the type of the question below. Here are some examples of questions.\n\n{context}\n{input}",
"nq": "Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n{context}\n\n{input}",
"triviaqa": "Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n{context}\n\n{input}",
"hotpotqa": "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{context}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"musique": "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{context}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"2wikimqa": "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{context}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"narrativeqa": "You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: {context}\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"qasper": "You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nArticle: {context}\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write \"unanswerable\". If the question is a yes/no question, answer \"yes\", \"no\", or \"unanswerable\". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
"gov_report": "You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n{context}\n\nNow, write a one-page summary of the report.\n\nSummary:",
"qmsum": "You are given a meeting transcript and a query containing a question or instruction. Answer the query in one or more sentences.\n\nTranscript:\n{context}\n\nNow, answer the query based on the above meeting transcript in one or more sentences.\n\nQuery: {input}\nAnswer:",
"passkey_retrieval": "下面有一大段话,中间藏了一段密码,密码的前后有特殊的标识。以密码58926为例,对应的标识如下:\n\n***密码开始***<58926>***密码结束***\n\n下面是这段话,请帮我找出其中的密码。\n\n{context}\n\n现在,请直接给出密码,例如,45331,89765,13299和58926等等。",
"lsht": "请判断给定新闻的类别,下面是一些例子。\n\n{context}\n{input}",
"dureader": "请基于给定的文章回答下述问题。\n\n文章:{context}\n\n请基于上述文章回答下面的问题。\n\n问题:{input}\n回答:",
"vcsum": "下面有一段会议记录,请你阅读后,写一段总结,总结会议的内容。\n会议记录:\n{context}\n\n会议总结:",
"multifieldqa_en": "Read the following text and answer briefly.\n\n{context}\n\nNow, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
"multifieldqa_zh": "阅读以下文字并用中文简短回答:\n\n{context}\n\n现在请基于上面的文章回答下面的问题,只告诉我答案,不要输出任何其他字词。\n\n问题:{input}\n回答:",
"lcc": "Please complete the code given below. \n{context}Next line of code:\n",
"repobench-p": "Please complete the code given below. \n{context}{input}Next line of code:\n"
}
\ No newline at end of file
eval.py 0 → 100644
import os
import json
from typing import List, Any, Dict, Callable
from tqdm import tqdm
from metrics import (
qa_f1_score,
rouge_zh_score,
qa_f1_zh_score,
rouge_score,
classification_score,
retrieval_score,
retrieval_zh_score,
count_score,
code_sim_score,
)
dataset2metric = {
"hotpotqa": qa_f1_score,
"2wikimqa": qa_f1_score,
"musique": qa_f1_score,
"dureader": rouge_zh_score,
"narrativeqa": qa_f1_score,
"qasper": qa_f1_score,
"multifieldqa_en": qa_f1_score,
"multifieldqa_zh": qa_f1_zh_score,
"gov_report": rouge_score,
"qmsum": rouge_score,
"vcsum": rouge_zh_score,
"trec": classification_score,
"nq": qa_f1_score,
"triviaqa": qa_f1_score,
"lsht": classification_score,
"passage_retrieval_en": retrieval_score,
"passage_count": count_score,
"passkey_retrieval_zh": retrieval_zh_score,
"lcc": code_sim_score,
"repobench-p": code_sim_score,
}
def scorer(dataset, predictions, answers, all_classes):
total_score = 0.
for (prediction, ground_truths) in zip(predictions, answers):
score = 0.
for ground_truth in ground_truths:
score = max(score, dataset2metric[dataset](prediction, ground_truth, all_classes=all_classes))
total_score += score
return round(100 * total_score / len(predictions), 2)
if __name__ == '__main__':
scores = dict()
all_files = os.listdir("pred/")
for filename in all_files:
predictions, answers = [], []
dataset = filename.split('.')[0]
with open(f"pred/{filename}", "r") as f:
for line in f:
data = json.loads(line)
predictions.append(data["pred"])
answers.append(data["answers"])
all_classes = data["all_classes"]
score = scorer(dataset, predictions, answers, all_classes)
scores[dataset] = score
with open("result.json", "w") as f:
json.dump(scores, f, ensure_ascii=False, indent=4)
import re
import string
import jieba
from fuzzywuzzy import fuzz
import difflib
from typing import List
from collections import Counter
from rouge import Rouge
def normalize_answer(s):
"""Lower text and remove punctuation, articles and extra whitespace."""
def remove_articles(text):
return re.sub(r"\b(a|an|the)\b", " ", text)
def white_space_fix(text):
return " ".join(text.split())
def remove_punc(text):
exclude = set(string.punctuation)
return "".join(ch for ch in text if ch not in exclude)
def lower(text):
return text.lower()
return white_space_fix(remove_articles(remove_punc(lower(s))))
def normalize_zh_answer(s):
"""Lower text and remove punctuation, extra whitespace."""
def white_space_fix(text):
return "".join(text.split())
def remove_punc(text):
cn_punctuation = "!?。。"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏."
all_punctuation = set(string.punctuation + cn_punctuation)
return "".join(ch for ch in text if ch not in all_punctuation)
def lower(text):
return text.lower()
return white_space_fix(remove_punc(lower(s)))
def count_score(prediction, ground_truth, **kwargs):
numbers = re.findall(r"\d+", prediction)
right_num = 0
for number in numbers:
if str(number) == str(ground_truth):
right_num += 1
final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers)
return float(final_score)
def retrieval_score(prediction, ground_truth, **kwargs):
pattern = r'Paragraph (\d+)'
matches = re.findall(pattern, ground_truth)
ground_truth_id = matches[0]
numbers = re.findall(r"\d+", prediction)
right_num = 0
for number in numbers:
if str(number) == str(ground_truth_id):
right_num += 1
final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers)
return float(final_score)
def retrieval_zh_score(prediction, ground_truth, **kwargs):
pattern = r'段落(\d+)'
matches = re.findall(pattern, ground_truth)
ground_truth_id = matches[0]
numbers = re.findall(r"\d+", prediction)
right_num = 0
for number in numbers:
if str(number) == str(ground_truth_id):
right_num += 1
final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers)
return float(final_score)
def code_sim_score(prediction, ground_truth, **kwargs):
all_lines = prediction.lstrip('\n').split('\n')
prediction = ""
for line in all_lines:
if ('`' not in line) and ('#' not in line) and ('//' not in line):
prediction = line
break
return (fuzz.ratio(prediction, ground_truth) / 100)
def classification_score(prediction, ground_truth, **kwargs):
em_match_list = []
all_classes = kwargs["all_classes"]
for class_name in all_classes:
if class_name in prediction:
em_match_list.append(class_name)
for match_term in em_match_list:
if match_term in ground_truth and match_term != ground_truth:
em_match_list.remove(match_term)
if em_match_list != 0:
if ground_truth in em_match_list:
score = (1.0 / len(em_match_list))
else:
score = 0.0
else:
best_match = None
highest_similarity = 0
for string in all_classes:
similarity = difflib.SequenceMatcher(None, string, prediction).ratio()
if similarity > highest_similarity:
highest_similarity = similarity
best_match = string
score = float(best_match == ground_truth)
return score
def rouge_score(prediction, ground_truth, **kwargs):
rouge = Rouge()
try:
scores = rouge.get_scores([prediction], [ground_truth], avg=True)
except:
return 0.0
return scores["rouge-l"]["f"]
def rouge_zh_score(prediction, ground_truth, **kwargs):
prediction = " ".join(list(jieba.cut(prediction, cut_all=False)))
ground_truth = " ".join(list(jieba.cut(ground_truth, cut_all=False)))
score = rouge_score(prediction, ground_truth)
return score
def f1_score(prediction, ground_truth, **kwargs):
common = Counter(prediction) & Counter(ground_truth)
num_same = sum(common.values())
if num_same == 0:
return 0
precision = 1.0 * num_same / len(prediction)
recall = 1.0 * num_same / len(ground_truth)
f1 = (2 * precision * recall) / (precision + recall)
return f1
def qa_f1_score(prediction, ground_truth, **kwargs):
normalized_prediction = normalize_answer(prediction)
normalized_ground_truth = normalize_answer(ground_truth)
prediction_tokens = normalized_prediction.split()
ground_truth_tokens = normalized_ground_truth.split()
return f1_score(prediction_tokens, ground_truth_tokens)
def qa_f1_zh_score(prediction, ground_truth, **kwargs):
prediction_tokens = list(jieba.cut(prediction, cut_all=False))
ground_truth_tokens = list(jieba.cut(ground_truth, cut_all=False))
prediction_tokens = [normalize_zh_answer(token) for token in prediction_tokens]
ground_truth_tokens = [normalize_zh_answer(token) for token in ground_truth_tokens]
prediction_tokens = [token for token in prediction_tokens if len(token) > 0]
ground_truth_tokens = [token for token in ground_truth_tokens if len(token) > 0]
return f1_score(prediction_tokens, ground_truth_tokens)
misc/curve.png

487 KiB

misc/logo.gif

287 KiB

misc/radar.png

830 KiB

pred.py 0 → 100644
import os
from datasets import load_dataset
import torch
import json
from transformers import AutoTokenizer, AutoModel
from tqdm import tqdm
# This is the customized building prompt for chat models, here is an example for ChatGLM2
def build_chat(tokenizer, prompt):
return tokenizer.build_prompt(prompt)
def get_pred(model, tokenizer, data, max_length, max_gen, prompt_format, dataset, device):
preds = []
for json_obj in tqdm(data[:10]):
prompt = prompt_format.format(**json_obj)
# truncate to fit max_length (we suggest truncate in the middle, since the left and right side may contain crucial instructions)
tokenized_prompt = tokenizer(prompt, truncation=False, return_tensors="pt").input_ids[0]
if len(tokenized_prompt) > max_length:
half = int(max_length/2)
prompt = tokenizer.decode(tokenized_prompt[:half], skip_special_tokens=True)+tokenizer.decode(tokenized_prompt[-half:], skip_special_tokens=True)
if dataset not in ["lcc", "repobench-p", "trec", "nq", "triviaqa", "lsht"]: # chat models are better off without build prompt on these tasks
prompt = build_chat(tokenizer, prompt)
input = tokenizer(prompt, truncation=False, return_tensors="pt").to(device)
context_length = input.input_ids.shape[-1]
output = model.generate(
**input,
max_new_tokens=max_gen,
num_beams=1,
do_sample=False,
temperature=1.0,
)[0]
pred = tokenizer.decode(output[context_length:], skip_special_tokens=True)
preds.append({"pred": pred, "answers": json_obj["answers"], "all_classes": json_obj["all_classes"]})
return preds
if __name__ == '__main__':
datasets = ["hotpotqa", "2wikimqa", "musique", "dureader", "narrativeqa", "qasper", "multifieldqa_en", "multifieldqa_zh", "gov_report", \
"qmsum", "vcsum", "trec", "nq", "triviaqa", "lsht", "passage_count", "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"]
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# define your model (ChatGLM2-6B, for instance)
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
model = model.eval()
# define max_length
max_length = 31500
# we design specific prompt format and max generation length for each task, feel free to modify them to optimize model output
dataset2prompt = json.load(open("config/dataset2prompt.json", "r"))
dataset2maxlen = json.load(open("config/dataset2maxlen.json", "r"))
# predict on each dataset
if not os.path.exists("pred"):
os.makedirs("pred")
for dataset in datasets:
data = load_dataset('THUDM/LongBench', dataset, split='test')
prompt_format = dataset2prompt[dataset]
max_gen = dataset2maxlen[dataset]
preds = get_pred(model, tokenizer, data, max_length, max_gen, prompt_format, dataset, device)
with open(f"pred/{dataset}.jsonl", "w") as f:
for pred in preds:
json.dump(pred, f)
f.write('\n')
\ No newline at end of file
@inproceedings{yang2018hotpotqa,
title={HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering},
author={Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D},
booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
pages={2369--2380},
year={2018}
}
@inproceedings{ho2020constructing,
title={Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps},
author={Ho, Xanh and Nguyen, Anh-Khoa Duong and Sugawara, Saku and Aizawa, Akiko},
booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
pages={6609--6625},
year={2020}
}
@article{trivedi2022musique,
title={♫ MuSiQue: Multihop Questions via Single-hop Question Composition},
author={Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish},
journal={Transactions of the Association for Computational Linguistics},
volume={10},
pages={539--554},
year={2022},
publisher={MIT Press One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA~…}
}
@article{he2018dureader,
title={DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications},
author={He, Wei and Liu, Kai and Liu, Jing and Lyu, Yajuan and Zhao, Shiqi and Xiao, Xinyan and Liu, Yuan and Wang, Yizhong and Wu, Hua and She, Qiaoqiao and others},
journal={ACL 2018},
pages={37},
year={2018}
}
@article{kovcisky2018narrativeqa,
title={The narrativeqa reading comprehension challenge},
author={Ko{\v{c}}isk{\`y}, Tom{\'a}{\v{s}} and Schwarz, Jonathan and Blunsom, Phil and Dyer, Chris and Hermann, Karl Moritz and Melis, G{\'a}bor and Grefenstette, Edward},
journal={Transactions of the Association for Computational Linguistics},
volume={6},
pages={317--328},
year={2018},
publisher={MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info~…}
}
@inproceedings{dasigi2021dataset,
title={A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers},
author={Dasigi, Pradeep and Lo, Kyle and Beltagy, Iz and Cohan, Arman and Smith, Noah A and Gardner, Matt},
booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
pages={4599--4610},
year={2021}
}
@inproceedings{huang2021efficient,
title={Efficient Attentions for Long Document Summarization},
author={Huang, Luyang and Cao, Shuyang and Parulian, Nikolaus and Ji, Heng and Wang, Lu},
booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
pages={1419--1436},
year={2021}
}
@inproceedings{zhong2021qmsum,
title={QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization},
author={Zhong, Ming and Yin, Da and Yu, Tao and Zaidi, Ahmad and Mutuma, Mutethia and Jha, Rahul and Hassan, Ahmed and Celikyilmaz, Asli and Liu, Yang and Qiu, Xipeng and others},
booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
pages={5905--5921},
year={2021}
}
@article{wu2023vcsum,
title={VCSUM: A Versatile Chinese Meeting Summarization Dataset},
author={Wu, Han and Zhan, Mingjie and Tan, Haochen and Hou, Zhaohui and Liang, Ding and Song, Linqi},
journal={arXiv preprint arXiv:2305.05280},
year={2023}
}
@inproceedings{joshi2017triviaqa,
title={TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension},
author={Joshi, Mandar and Choi, Eunsol and Weld, Daniel S and Zettlemoyer, Luke},
booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={1601--1611},
year={2017}
}
@article{kwiatkowski2019natural,
title={Natural questions: a benchmark for question answering research},
author={Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and others},
journal={Transactions of the Association for Computational Linguistics},
volume={7},
pages={453--466},
year={2019},
publisher={MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info~…}
}
@inproceedings{li2002learning,
title={Learning question classifiers},
author={Li, Xin and Roth, Dan},
booktitle={COLING 2002: The 19th International Conference on Computational Linguistics},
year={2002}
}
@article{guo2023longcoder,
title={LongCoder: A Long-Range Pre-trained Language Model for Code Completion},
author={Guo, Daya and Xu, Canwen and Duan, Nan and Yin, Jian and McAuley, Julian},
journal={arXiv preprint arXiv:2306.14893},
year={2023}
}
@article{liu2023repobench,
title={RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems},
author={Liu, Tianyang and Xu, Canwen and McAuley, Julian},
journal={arXiv preprint arXiv:2306.03091},
year={2023}
}
\ No newline at end of file
task.md 0 → 100644
# 任务统计
| 任务 | 任务类型 | 评价指标 | 平均长度 |语言 | Sample数量|
| :--------- | :-----------:| :-----------: |:---------: | :-------------: |:---------: |
| HotpotQA | 多文档QA | F1 |9,149 |英文 |200 |
| 2WikiMultihopQA| 多文档QA | F1 |4,885 |英文 |200 |
| Musique| 多文档QA | F1 |7,798 |英文 |200 |
| DuReader| 多文档QA | Rouge-L |15,768 |中文 |200 |
| MultiFieldQA-en| 单文档QA | F1 |4,559 |英文 |150 |
| MultiFieldQA-zh| 单文档QA | F1 |6,771 |中文 |200 |
| NarrativeQA| 单文档QA | F1 |18,405 |英文 |200 |
| Qasper| 单文档QA | F1 |3,619 |英文 |200 |
| GovReport| 摘要 | Rouge-L |8,169 |英文 |200 |
| QMSum| 摘要 | Rouge-L |10,546 |英文 |200 |
| VCSUM| 摘要 | Rouge-L |15,147 |中文 |200 |
| TriviaQA| Few shot | F1 |8,015 |英文 |200 |
| NQ| Few shot | F1 |8,210 |英文 |200 |
| TREC| Few shot | Accuracy |5,176 |英文 |200 |
| LSHT| Few shot | Accuracy |22,333 |中文 |200 |
| PassageRetrieval-en| 合成任务 | Accuracy |9,288 |英文 |200 |
| PassageCount| 合成任务 | Accuracy |11,141 |英文 |200 |
| PassageRetrieval-zh | 合成任务 | Accuracy |6,745 |中文 |200 |
| LCC| 代码 | Edit Sim |1,235 |Python/C#/Java |500 |
| RepoBench-P| 代码 | Edit Sim |5,622 |Python/Java |500 |
> 注:为了避免不同Tokenizer统计的差距,我们使用单词数(Python的split函数)来统计英文数据集和代码数据集的平均长度,使用汉字数来统计中文数据集的平均长度。
# 任务说明
| 任务 | 任务说明 |
| :----------------- | :----------------------------------------------------------- |
| HotpotQA | 基于多篇给定的文档,回答相关问题 |
| 2WikiMultihopQA | 基于多篇给定的文档,回答相关问题 |
| Musique | 基于多篇给定的文档,回答相关问题 |
| DuReader | 基于多篇给定的检索文档,回答相关的中文问题 |
| MultiFieldQA-en | 基于单篇文档,回答英文问题,文档所属的领域相对多元 |
| MultiFieldQA-zh | 基于单篇文档,回答中文问题,文档所属的领域相对多元 |
| NarrativeQA | 基于故事或剧本提问,包括对人物、情节、主题等重要元素的理解 |
| Qasper | 基于单篇论文的提出,问题由NLP的读者提出,并由NLP从业者回答 |
| GovReport | 摘要任务,要求对政府的工作报告进行总结摘要 |
| QMSum | 摘要任务,要求基于用户的查询对会议记录进行摘要 |
| VCSUM | 摘要任务,要求对中文会议记录进行总结摘要 |
| TriviaQA | 单文档问答任务,提供若干的Few Shot样例 |
| NQ | 单文档问答任务,提供若干的Few Shot样例 |
| TREC | 分类任务,要求对问题进行分类,一共包含50个类别 |
| LSHT | 中文分类任务,要求对新闻进行分类,一共包含24个类别 |
| PassageRetrieval-en | 给定30个英文维基的段落,判断给定的摘要属于哪个段落 |
| PassageCount | 判断给定的若干的段落中不重复的段落一共有几个 |
| PassageRetrieval-zh | 给定若干个出自C4数据集的中文段落,判断给定的摘要属于哪个段落 |
| LCC | 给定一段较长代码,要求预测出下一行代码 |
| RepoBench-P | 给定一个github仓库内多个文件中的代码(包含文件间依赖),要求预测出下一行代码 |
# 数据构造方式
> 注:对于所有基于已有数据集构造的任务,我们均选用原有数据集的验证集或测试集的数据(VCSUM任务除外)
- [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573)[DuReader](https://github.com/baidu/DuReader)任务基于原始的数据集构建,并进行相关处理使其适用于长文本评测。具体地,对于验证集中的问题,我们会选取包含答案的evidence passage和若干干扰的文章,这些文章和原始的问题共同组成了相关任务的输入。
- MultiFiedQA-zh和MultiFieldQA-en任务由约10种来源的长文本数据组成,包含Latex论文、裁判文书、政府工作报告和谷歌索引的PDF文档等。对于每篇长文本,我们邀请了若干博士生和硕士生来进行标注,即基于长文本提问,并给出正确的答案。为了更好地进行自动化评测,我们要求标注员尽可能提出有确定性答案的问题。
- [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf)[QMSum](https://arxiv.org/pdf/2104.05938.pdf)任务直接使用原论文提供的数据。在具体的构建中,我们使用[ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/)提供的模板来将对应的数据转换为纯文本的输入。
- [VCSUM](https://arxiv.org/abs/2305.05280)任务基于原始的数据集构建,我们针对该数据设计了相应的模板将对应的数据转换为纯文本的输入。
- [TriviaQA](https://nlp.cs.washington.edu/triviaqa/)[NQ](https://ai.google.com/research/NaturalQuestions/)任务参考[CoLT5](https://arxiv.org/abs/2303.09752)的方式进行构建,即会提供若干基于文档进行问答的样例,并要求语言模型基于新的文档回答相关问题。
- [TREC](https://aclanthology.org/C02-1150.pdf)[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf)任务基于原始的数据集构建。对于验证集中的每个问题,我们采样训练集中的若干数据组成Few-shot样例。这些样例会和验证集中的问题共同组成该任务的输入。
- PassageRetrieval-en任务基于英文维基进行构造。对于每条数据,我们随机采样30段英文维基的段落,并选取其中一段进行摘要(使用GPT-3.5-Turbo)。该任务要求模型给出摘要应该对应哪个的原始段落。
- PassageCount任务基于英文维基进行构造。对于每条数据,我们随机采样若干英文维基的段落,并将其中的每个段落随机重复若干次,最后将段落随机打乱。该任务要求模型判断给定的若干的段落中不重复的段落一共有几个。
- PassageRetrieval-zh任务基于[C4](https://arxiv.org/abs/1910.10683)进行构造。对于每条数据,我们随机采样若干段来自于C4的中文段落,并选取其中一段进行摘要(使用GPT-3.5-Turbo)。该任务要求模型给出摘要应该对应哪个的原始段落。
- [LCC](https://arxiv.org/abs/2306.14893)任务我们基于原始的代码补全数据集采样构建。[RepoBench-P](https://arxiv.org/abs/2306.03091)任务中我们选取了原数据集最具挑战性的XF-F(Cross-File-First)设定,并且参考原文中的Oracle-Filled场景,对于每一条原始数据我们随机抽取包括有效跨文件代码片段(gold snippet)在内的多个跨文件代码片段,将其拼接后作为输入,要求模型从其中利用有效的跨文件代码以补全当前文件中的代码。
\ No newline at end of file
# Task statistics
| Task | Task Type | Eval metric | Avg len |Language | \#Sample |
| --------- | -------------| ------------- |--------- | ------------- |--------- |
| HotpotQA | Multi-doc QA | F1 |9149 |EN |200 |
| 2WikiMultihopQA| Multi-doc QA | F1 |4885 |EN |200 |
| Musique| Multi-doc QA | F1 |7798 |EN |200 |
| DuReader| Multi-doc QA | Rouge-L |15768 |ZH |200 |
| MultiFieldQA-en| Single-doc QA | F1 |4559 |EN |150 |
| MultiFieldQA-zh| Single-doc QA | F1 |6771 |ZH |200 |
| NarrativeQA| Single-doc QA | F1 |18405 |EN |200 |
| Qasper| Single-doc QA | F1 |3619 |EN |200 |
| GovReport| Summarization | Rouge-L |8169 |EN |200 |
| QMSum| Summarization | Rouge-L |10546 |EN |200 |
| VCSUM| Summarization | Rouge-L |15147 |ZH |200 |
| TriviaQA| Few shot | F1 |8015 |EN |200 |
| NQ| Few shot | F1 |8210 |EN |200 |
| TREC| Few shot | Accuracy |5176 |EN |200 |
| LSHT| Few shot | Accuracy |22333 |ZH |200 |
| PassageRetrieval-en| Synthetic | Accuracy |9288 |EN |200 |
| PassageCount| Synthetic | Accuracy |11141 |EN |200 |
| PassageRetrieval-zh | Synthetic | Accuracy |6745 |ZH |200 |
| LCC| Code | Edit Sim |1235 |Python/C#/Java |500 |
| RepoBench-P| Code | Edit Sim |5622 |Python/Java |500 |
> Note: In order to avoid discrepancies caused by different tokenizers, we use the word count (using Python's split function) to calculate the average length of English datasets and code datasets, and use the character count to calculate the average length of Chinese datasets.
# Task description
| Task | Task Description |
| ----------------- | ------------------------------------------------------------ |
| HotpotQA | Answer related questions based on multiple given documents |
| 2WikiMultihopQA | Answer related questions based on multiple given documents |
| Musique | Answer related questions based on multiple given documents |
| DuReader | Answer related Chinese questions based on multiple retrieved documents |
| MultiFieldQA-en | Answer English questions based on a single document, which comes from a relatively diverse field |
| MultiFieldQA-zh | Answer Chinese questions based on a single document, which comes from a relatively diverse field |
| NarrativeQA | Ask questions based on stories or scripts, including understanding of important elements such as characters, plots, themes, etc. |
| Qasper | Ask questions based on a single paper, questions proposed by NLP readers, and answered by NLP practitioners |
| GovReport | A summarization task that requires summarizing government work reports |
| QMSum | A summarization task that requires summarizing meeting records based on user queries |
| VCSUM | A summarization task that requires summarizing Chinese meeting records |
| TriviaQA | Single document question answering task, providing several Few Shot examples |
| NQ | Single document question answering task, providing several Few Shot examples |
| TREC | A classification task that requires categorizing questions, includes 50 categories in total |
| LSHT | A Chinese classification task that requires categorizing news, includes 24 categories in total |
| PassageRetrieval-en | Given 30 English Wikipedia paragraphs, determine which paragraph the given summary belongs to |
| PassageCount | Determine the number of non-repeating paragraphs in a given number of paragraphs |
| PassageRetrieval-zh | Given several Chinese paragraphs from the C4 data set, determine which paragraph the given abstract belongs to |
| LCC | Given a longer piece of code, predict the next line of code |
| RepoBench-P | Given code in multiple files within a GitHub repository (including inter-file dependencies), predict the next line of code |
# Task construction
> Note: For all tasks constructed from existing datasets, we use data from the validation or test set of the existing dataset (except for VCSUM).
- The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to make them suitable for long text evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the related tasks.
- The tasks of MultiFiedQA-zh and MultiFieldQA-en consist of long-text data from about 10 sources, including Latex papers, judicial documents, government work reports, and PDF documents indexed by Google. For each long text, we invite several PhD and master students to annotate, i.e., to ask questions based on the long text and give the correct answers. To better automate evaluation, we require the annotators to ask questions with definitive answers as much as possible.
- The tasks of [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), and [QMSum](https://arxiv.org/pdf/2104.05938.pdf) directly use the data provided by the original papers. In the specific construction, we use the template provided by [ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/) to convert the corresponding data into pure text input.
- The [VCSUM](https://arxiv.org/abs/2305.05280) task is built based on the original dataset, and we have designed a corresponding template to convert the corresponding data into pure text input.
- The tasks of [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) and [NQ](https://ai.google.com/research/NaturalQuestions/) are constructed in the manner of [CoLT5](https://arxiv.org/abs/2303.09752), which provides several examples of question and answering based on documents, and requires the language model to answer related questions based on new documents.
- The tasks of [TREC](https://aclanthology.org/C02-1150.pdf) and [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf) are built based on the original datasets. For each question in the validation set, we sample several data from the training set to form few-shot examples. These examples together with the questions in the validation set constitute the input for this task.
- The PassageRetrieval-en task is constructed based on English Wikipedia. For each piece of data, we randomly sample 30 paragraphs from English Wikipedia and select one for summarization (using GPT3.5 Turbo). The task requires the model to specify which original paragraph the summary corresponds to.
- The PassageCount task is constructed based on the English wiki. For each piece of data, we randomly sample several passages from English Wikipedia, repeat each paragraph at random several times, and finally shuffle the paragraphs at random. This task requires the model to determine how many paragraphs are not repeated in a given number of paragraphs.
- The PasskeyRetrieval-zh task is constructed based on [C4](https://arxiv.org/abs/1910.10683). For each piece of data, we randomly sample several Chinese paragraphs from C4 and select one of them for summary (using GPT3.5 Turbo). This task requires the model to give the original paragraph to which the summary should correspond.
- For the [LCC](https://arxiv.org/abs/2306.14893) task, we sample from the original code completion dataset. In the [RepoBench-P](https://arxiv.org/abs/2306.03091) task, we select the most challenging XF-F (Cross-File-First) setting from the original dataset and refer to the Oracle-Filled scenario in the original text. For each original piece of data, we randomly extract multiple cross-file code snippets, including the effective cross-file code snippet (gold snippet), and concatenate them as input, requiring the model to use the effective cross-file code to complete the current file's code.
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment