From 62fa455dd58c3084c346324bc35f70a057fdacfc Mon Sep 17 00:00:00 2001 From: Lv Xin <lvxin14@163.com> Date: Sun, 30 Jul 2023 11:18:27 +0800 Subject: [PATCH] Change the default language of README to English --- README.md | 292 +++++++++++++++++++++++++++------------------------ README_EN.md | 187 --------------------------------- README_ZH.md | 175 ++++++++++++++++++++++++++++++ task.md | 119 ++++++++++----------- task_en.md | 67 ------------ task_zh.md | 66 ++++++++++++ 6 files changed, 453 insertions(+), 453 deletions(-) delete mode 100644 README_EN.md create mode 100644 README_ZH.md delete mode 100644 task_en.md create mode 100644 task_zh.md diff --git a/README.md b/README.md index d97741a..95519ba 100644 --- a/README.md +++ b/README.md @@ -3,62 +3,65 @@ 🤗 <a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank">HF Repo</a> • 📃 Paper coming soon! </p> -Read this in [English](README_EN.md). - -# LongBench: 多è¯è¨€ã€å¤šä»»åŠ¡é•¿æ–‡æœ¬ç†è§£è¯„测基准 - -**LongBench**是第一个多è¯è¨€ã€å¤šä»»åŠ¡ã€å…¨é¢è¯„估大è¯è¨€æ¨¡åž‹**长文本ç†è§£**能力的综åˆæ•°æ®é›†ã€‚在目å‰å¤§æ¨¡åž‹å¤šè¯è¨€èƒ½åŠ›å¼•èµ·å¹¿æ³›å…³æ³¨çš„背景下,LongBench涵盖了ä¸åŒçš„è¯è¨€ï¼ˆä¸æ–‡å’Œè‹±æ–‡ï¼‰ï¼Œä»¥æ¤æ¥å¯¹å¤§æ¨¡åž‹åœ¨é•¿æ–‡æœ¬ä¸‹çš„多è¯è¨€èƒ½åŠ›è¿›è¡Œæ›´å…¨é¢çš„评估。åŒæ—¶ï¼ŒLongBench由二å个ä¸åŒçš„任务组æˆï¼Œè¦†ç›–了å•æ–‡æ¡£QAã€å¤šæ–‡æ¡£QAã€æ‘˜è¦ã€Few-shotå¦ä¹ ã€ä»£ç 补全和åˆæˆä»»åŠ¡ç‰å…³é”®çš„长文本应用场景。 - -我们深知模型评测过程ä¸å¯èƒ½äº§ç”Ÿçš„高昂æˆæœ¬ï¼Œå°¤å…¶æ˜¯é•¿æ–‡æœ¬åœºæ™¯ä¸‹ï¼ˆå¦‚äººå·¥æ ‡æ³¨æˆæœ¬æˆ–API调用æˆæœ¬ï¼‰ã€‚å› æ¤ï¼Œæˆ‘们采用了一ç§å…¨è‡ªåŠ¨çš„评测方å¼ï¼Œæ—¨åœ¨ä»¥æœ€ä½Žçš„æˆæœ¬ï¼Œæœ€æœ‰æ•ˆåœ°è¡¡é‡å’Œè¯„估模型的长文本ç†è§£èƒ½åŠ›ã€‚ - -LongBench包å«13个英文任务ã€5个ä¸æ–‡ä»»åŠ¡å’Œ2个代ç 任务,多数任务的平å‡é•¿åº¦åœ¨5k-15k之间,共包å«çº¦4500æ¡æµ‹è¯•æ•°æ®ã€‚从主è¦ä»»åŠ¡åˆ†ç±»ä¸Šï¼ŒLongBench包å«å…大类任务,å³å•æ–‡æ¡£QAã€å¤šæ–‡æ¡£QAã€æ‘˜è¦ã€Few-shotå¦ä¹ ã€ä»£ç 补全和åˆæˆä»»åŠ¡ã€‚关于LongBenchæ•°æ®é›†çš„具体统计åŠä»»åŠ¡æž„é€ æ–¹å¼è¯·å‚考[这里](task.md)。 - -| 任务类型 | 英文任务数 | ä¸æ–‡ä»»åŠ¡æ•° | 代ç 任务数 | -| :----------: | :--------: | :--------: | :--------: | -| å•æ–‡æ¡£QA | 3 | 1 | - | -| 多文档QA | 3 | 1 | - | -| æ‘˜è¦ | 2 | 1 | - | -| Few-shotå¦ä¹ | 3 | 1 | - | -| åˆæˆä»»åŠ¡ | 2 | 1 | - | -| 代ç 补全 | - | - | 2 | - -## 排行榜 -我们在这里展示了所有模型在Zero-shot场景下,在ä¸æ–‡å’Œè‹±æ–‡å„大类任务上得分的平å‡å€¼ï¼ˆ%),å„ä»»åŠ¡è¯„ä¼°æ‰€ç”¨æŒ‡æ ‡è¯·å‚考[这里](task.md)。 -#### è‹±æ–‡æ¦œå• -| | Avg | å•æ–‡æ¡£QA | 多文档QA | æ‘˜è¦ | Few-shotå¦ä¹ | 代ç 补全 | åˆæˆä»»åŠ¡ | -| --- | :-: | :-: | :-: | :-: | :-: | :-: | :-: | -| GPT-3.5-Turbo-16k | 45.5 | 39.8 | 38.7 | 26.5 | 76.0 | 54.5 | 37.8 | -| Llama2-7B-chat-4k | 29.0 | 24.8 | 21.4 | 23.9 | 50.5 | 47.3 | 5.9 | -| LongChat-7B-16k | 33.7 | 29.3 | 16.1 | 25.8 | 59.9 | 57.0 | 14.2 | -| XGen-7B-8k | 28.7 | 24.5 | 20.4 | 24.8 | 58.7 | 38.0 | 5.6 | -| InternLM-7B-8k | 24.7 | 17.1 | 20.8 | 13.3 | 52.7 | 39.7 | 4.7 | -| ChatGLM2-6B | 26.0 | 23.1 | 15.0 | 22.9 | 46.1 | 46.1 | 2.7 | -| ChatGLM2-6B-32k | 42.7 | 32.8 | 34.0 | 28.6 | 68.1 | 52.7 | 39.8 | - -#### ä¸æ–‡æ¦œå• -| | Avg | å•æ–‡æ¡£QA | 多文档QA | æ‘˜è¦ | Few-shotå¦ä¹ | 代ç 补全 | åˆæˆä»»åŠ¡ | -|-------|:---:|:-------------:|:------------:|:-------------:|:-----------------:|:---------------:|:----------------:| -| GPT-3.5-Turbo-16k | 44.5 | 61.2 | 28.7 | 16.0 | 29.2 | 54.5 | 77.5 | -| Llama2-7B-chat-4k | 13.5 | 11.6 | 1.9 | 0.2 | 19.8 | 47.3 | 0.5 | -| LongChat-7B-16k | 23.7 | 26.6 | 19.1 | 14.0 | 20.8 | 57.0 | 4.8 | -| XGen-7B-8k | 14.5 | 14.2 | 9.1 | 1.5 | 20.0 | 38.0 | 4.2 | -| InternLM-7B-8k | 18.6 | 33.3 | 8.9 | 13.0 | 15.5 | 39.7 | 0.9 | -| ChatGLM2-6B | 22.5 | 33.0 | 15.2 | 14.6 | 20.5 | 46.1 | 5.5 | -| ChatGLM2-6B-32k | 41.3 | 52.0 | 34.3 | 16.3 | 29.9 | 52.7 | 62.5 | - -#### 长文本任务能力雷达图 +阅读 [ä¸æ–‡ç‰ˆæœ¬](README_ZH.md) + +# LongBench: A Multilingual, Multitask Benchmark Tailored for Long Context Understanding + +**LongBench** is the first comprehensive dataset for multi-language, multi-task, and comprehensive assessment of **long text understanding** capabilities of large language models. In the context of the widespread attention to the multi-language capabilities of large models, LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models' multi-language capabilities in long texts. In addition, LongBench consists of twenty different tasks, covering key long-text application scenarios such as single-document QA, multi-document QA, summaries, few-shot learning, code completion, and synthesis tasks. + +We are fully aware of the potentially high costs involved in the model evaluation process, especially in the context of long-text scenarios (such as manual annotation costs or API call costs). Therefore, we have adopted a fully automated evaluation method, aimed at measuring and evaluating the model's ability to understand long texts at the lowest cost and most effectively. + +LongBench includes 13 English tasks, 5 Chinese tasks, and 2 code tasks, with the average length of most tasks ranging from 5k to 15k. From the main task categories, LongBench includes six types of tasks, namely multi-document QA, single-document QA, summaries, Few-shot learning, synthetic tasks, and code completion. For detailed statistics and construction methods of LongBench tasks, please refer [here](task.md). + +| Task Type | \#English Task | \#Chinese Task | \#Code Task | +| :-------: | :--------------------: | :--------------------: | :------------------: | +| Multi-document QA | 3 | 1 | - | +| Single-document QA | 3 | 1 | - | +| Summarization | 2 | 1 | - | +| Few-shot learning | 3 | 1 | - | +| Synthetic Tasks | 2 | 1 | - | +| Code Completion | - | - | 2 | + +## Leaderboard +Here is the average score (%) of all models on various major tasks in both Chinese and English languages under the Zero-shot scenario. Please refer to this [link](task.md) for the evaluation metrics used for each task. + +#### English +| | Avg | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks | +| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: | +| GPT-3.5-Turbo-16k | 45.5 | 39.8 | 38.7 | 26.5 | 76.0 | 54.5 | 37.8 | +| Llama2-7B-chat-4k | 29.0 | 24.8 | 21.4 | 23.9 | 50.5 | 47.3 | 5.9 | +| LongChat-7B-16k | 33.7 | 29.3 | 16.1 | 25.8 | 59.9 | 57.0 | 14.2 | +| XGen-7B-8k | 28.7 | 24.5 | 20.4 | 24.8 | 58.7 | 38.0 | 5.6 | +| InternLM-7B-8k | 24.7 | 17.1 | 20.8 | 13.3 | 52.7 | 39.7 | 4.7 | +| ChatGLM2-6B | 26.0 | 23.1 | 15.0 | 22.9 | 46.1 | 46.1 | 2.7 | +| ChatGLM2-6B-32k | 42.7 | 32.8 | 34.0 | 28.6 | 68.1 | 52.7 | 39.8 | + +#### Chinese + +| | Avg | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks | +| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: | +| GPT-3.5-Turbo-16k | 44.5 | 61.2 | 28.7 | 16.0 | 29.2 | 54.5 | 77.5 | +| Llama2-7B-chat-4k | 13.5 | 11.6 | 1.9 | 0.2 | 19.8 | 47.3 | 0.5 | +| LongChat-7B-16k | 23.7 | 26.6 | 19.1 | 14.0 | 20.8 | 57.0 | 4.8 | +| XGen-7B-8k | 14.5 | 14.2 | 9.1 | 1.5 | 20.0 | 38.0 | 4.2 | +| InternLM-7B-8k | 18.6 | 33.3 | 8.9 | 13.0 | 15.5 | 39.7 | 0.9 | +| ChatGLM2-6B | 22.5 | 33.0 | 15.2 | 14.6 | 20.5 | 46.1 | 5.5 | +| ChatGLM2-6B-32k | 41.3 | 52.0 | 34.3 | 16.3 | 29.9 | 52.7 | 62.5 | + +#### Radar Chart of Long Text Task Capability +  -#### ä¸åŒé•¿åº¦æ–‡æœ¬ä¸‹çš„能力å˜åŒ– -为了更有针对性地分æžæ¨¡åž‹åœ¨ä¸åŒæ–‡æœ¬é•¿åº¦ä¸‹çš„相对表现,下图展示了模型在ä¸åŒæ–‡æœ¬é•¿åº¦åŒºé—´ä¸Šï¼Œæ‰€æœ‰ä»»åŠ¡ä¸Šçš„å¹³å‡ç›¸å¯¹åˆ†æ•°ã€‚ +#### Variation of Abilities under Different Text Lengths +To more specifically analyze the model's relative performance under different text lengths, the following chart shows the average relative scores on all tasks over different text length intervals.  -> 注:å‡è®¾æ¨¡åž‹åœ¨æŸä¸ªä»»åŠ¡çš„特定长度范围内数æ®ä¸Šå¾—分为x,在该任务所有数æ®ä¸Šå¾—分为y,则模型在该长度范围的**相对分数**为(x/y-1)。为了更好比较ä¸åŒæ¨¡åž‹çš„å˜åŒ–趋势,我们在0-4k将所有折线平移至0。 +> Note: Assume that the model scores x on the data within a specific length range of a task, and y on all data of that task, then the model's **relative score** for that length range is (x/y-1). To better compare the trends of different models, we shift all the lines to 0 from 0-4k. -## 如何在LongBench上评测模型 +## How to evaluate models on LongBench -#### è½½å…¥æ•°æ® -ä½ å¯ä»¥é€šè¿‡Hugging Face datasetsæ¥ä¸‹è½½å¹¶è½½å…¥**LongBench**çš„æ•°æ®ï¼ˆ[🤗 HF Repo](https://huggingface.co/datasets/THUDM/LongBench)): +#### Loading Data +You can download and load the **LongBench** data through the Hugging Face datasets ([🤗 HF Repo](https://huggingface.co/datasets/THUDM/LongBench)): ```python from datasets import load_dataset @@ -69,107 +72,116 @@ datasets = ["hotpotqa", "2wikimqa", "musique", "dureader", "narrativeqa", "qaspe for dataset in datasets: data = load_dataset('THUDM/LongBench', dataset, split='test') ``` -åŒæ ·åœ°ï¼Œä½ 也å¯ä»¥ç›´æŽ¥ç”¨è¿™ä¸ª[链接](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip)下载所有的评测数æ®ã€‚ +Alternatively, you can download the folder from [this link](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip) to load the data. + +#### Data Format + +All data in **LongBench** are standardized to the following format: -#### æ•°æ®æ ¼å¼ -**LongBench**ä¸æ‰€æœ‰æ•°æ®éƒ½ç»Ÿä¸€ä¸ºä»¥ä¸‹æ ¼å¼ï¼š ```json { - "input": "任务的输入/指令,通常较çŸï¼Œæ¯”如QAä¸çš„问题ã€Few-shot任务ä¸çš„æé—®ç‰", - "context": "任务所需的长è¯å¢ƒæ–‡æœ¬ï¼Œæ¯”如文档ã€è·¨æ–‡ä»¶ä»£ç ã€Few-shot任务ä¸çš„few-shotæ ·æœ¬", - "answers": "ç”±æ‰€æœ‰æ ‡å‡†ç”案组æˆçš„列表", - "length": "å‰ä¸‰é¡¹æ–‡æœ¬çš„总长度(ä¸ã€è‹±æ–‡åˆ†åˆ«ç”¨å—ã€è¯æ•°ç»Ÿè®¡ï¼‰", - "dataset": "本æ¡æ•°æ®æ‰€å±žæ•°æ®é›†å称", - "language": "本æ¡æ•°æ®çš„è¯è¨€", - "all_classes": "分类任务ä¸çš„所有类别,éžåˆ†ç±»ä»»åŠ¡åˆ™ä¸ºnull", - "_id": "æ¯æ¡æ•°æ®çš„éšæœºid" + "input": "The input/command for the task, usually short, such as questions in QA, queries in Few-shot tasks, etc.", + "context": "The long context text required for the task, such as documents, cross-file code, few-shot samples in Few-shot tasks", + "answers": "List composed of all standard answers", + "length": "Total length of the first three items of text (counted in characters for Chinese and words for English)", + "dataset": "The name of the dataset to which this piece of data belongs", + "language": "The language of this piece of data", + "all_classes": "All categories in classification tasks, null for non-classification tasks", + "_id": "Random id for each piece of data" } ``` -#### 评测 -我们以ChatGLM2-6B为例æ供了一份评测代ç 。首先,è¿è¡Œä»“库下的[pred.py](pred.py) +#### Evaluation +We provide an evaluation code using ChatGLM2-6B as an example. Firstly, run the [pred.py](pred.py) under the repository: ```bash CUDA_VISIBLE_DEVICES=0 python pred.py ``` -å¯ä»¥åœ¨`pred/`文件夹下得到模型在所有数æ®é›†ä¸‹çš„输出,æ¤åŽè¿è¡Œ[eval.py](eval.py)的评测代ç : +You can get the model outputs on all datasets in the `pred/` folder. After that, run the evaluation code of [eval.py](eval.py): ```bash python eval.py ``` -å¯ä»¥åœ¨`result.json`ä¸å¾—到在å„æ•°æ®é›†ä¸Šçš„评测结果。请注æ„,我们在`config/`下æ供了我们总结出æ¥çš„在å„æ•°æ®é›†ä¸Šé€‚åˆçš„è¾“å…¥æ ¼å¼å’Œæœ€å¤§è¾“出长度é™åˆ¶ï¼Œåœ¨è¯„测的时候å¯ä»¥è¿›è¡Œä¿®æ”¹ä»¥æ›´å¥½åœ°é€‚ç”¨ä½ è¦è¯„测的模型,修改åŽåœ¨[pred.py](pred.py)è¯„æµ‹æ—¶ä¼šè‡ªåŠ¨æŒ‰ç…§æ–°çš„æ ¼å¼åŽ»æ•´ç†æ•°æ®å¹¶å¾—到对应的模型输出。 +You can get the evaluation results on various datasets in `result.json`. Please note that we provide the input format suitable for each dataset and the maximum output length limit we summarized in `config/`. You can modify them during the evaluation to better suit the model you want to evaluate. After modification, when evaluating with [pred.py](pred.py), the data will be automatically organized according to the new format to get the corresponding model output. + +## Evaluation Result on Each Dataset + +The following tables display the Zero-shot evaluation results (%) of the model on all subtask datasets, where Chinese datasets are denoted by "zh" (please refer to this [link](task.md) for the evaluation metrics used for each task). -## 详细评测结果 -下é¢çš„å‡ å¼ è¡¨æ ¼å±•ç¤ºäº†æ¨¡åž‹åœ¨æ‰€æœ‰å任务数æ®é›†ä¸Šçš„Zero-shot评测结果(%),其ä¸çš„ä¸æ–‡æ•°æ®é›†ä»¥â€œzhâ€æ ‡ç¤ºï¼ˆå„ä»»åŠ¡è¯„ä¼°æ‰€ç”¨æŒ‡æ ‡è¯·å‚考[这里](task.md))。 -> 注:对于超出模型处ç†é•¿åº¦èƒ½åŠ›çš„文本,å‚考[Lost in the Middle](https://arxiv.org/abs/2307.03172)的观察,我们从文本ä¸é—´è¿›è¡Œæˆªæ–,ä¿æŒå‰åŽéƒ¨åˆ†çš„ä¿¡æ¯ã€‚实验表明,这ç§æˆªæ–æ–¹å¼å¯¹æ¨¡åž‹æ€§èƒ½å½±å“最å°ã€‚ -#### å•æ–‡æ¡£QA +#### Single-Document QA | | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh | -|-------------------|:-----------:|:------:|:---------------:|:---------------:| -| GPT-3.5-Turbo-16k | 23.6 | 43.3 | 52.3 | 61.2 | -| Llama2-7B-chat-4k | 19.1 | 19.6 | 35.8 | 11.6 | -| LongChat-7B-16k | 21.6 | 21.6 | 44.6 | 26.6 | -| XGen-7B-8k | 17.9 | 18.3 | 37.2 | 14.2 | -| InternLM-7B-8k | 12.4 | 16.8 | 22.3 | 33.3 | -| ChatGLM2-6B | 11.2 | 23.7 | 34.2 | 33.0 | -| ChatGLM2-6B-32k | 20.4 | 32.2 | 45.7 | 52.0 | - -#### 多文档QA -| | HotpotQA | 2WikiMQA | Musique | DuReader (zh) | -|----------------------|:--------:|:--------:|:-------:|:--------:| -| GPT-3.5-Turbo-16k | 51.6 | 37.7 | 26.9 | 28.7 | -| Llama2-7B-chat-4k | 24.3 | 31.4 | 8.6 | 1.9 | -| LongChat-7B-16k | 22.4 | 16.8 | 9.1 | 19.1 | -| XGen-7B-8k | 28.3 | 21.5 | 11.5 | 9.1 | -| InternLM-7B-8k | 27.9 | 24.0 | 10.3 | 8.9 | -| ChatGLM2-6B | 20.2 | 19.6 | 5.3 | 15.2 | -| ChatGLM2-6B-32k | 44.9 | 34.9 | 22.2 | 34.3 | - -#### æ‘˜è¦ -| | GovReport | QMSum | VCSUM (zh) | -|:-----------|:---------:|:-----:|:-----:| -| GPT-3.5-Turbo-16k | 29.5 | 23.4 | 16.0 | -| Llama2-7B-chat-4k | 27.3 | 20.6 | 0.2 | -| LongChat-7B-16k | 28.4 | 23.2 | 14.0 | -| XGen-7B-8k | 27.8 | 21.7 | 1.5 | -| InternLM-7B-8k | 9.8 | 16.8 | 13.0 | -| ChatGLM2-6B | 23.7 | 22.2 | 14.6 | -| ChatGLM2-6B-32k | 33.3 | 23.9 | 16.3 | - -#### Few-shotå¦ä¹ -| | TREC | NQ | TriviaQA | LSHT (zh) | -| --- | :-: | :-: | :-: | :-: | -| GPT-3.5-Turbo-16k | 68.0 | 73.0 | 87.1 | 29.2 | -| Llama2-7B-chat-4k | 60.5 | 31.4 | 59.7 | 19.8 | -| LongChat-7B-16k | 61.5 | 44.8 | 73.5 | 20.8 | -| XGen-7B-8k | 66.0 | 43.2 | 67.0 | 20.0 | -| InternLM-7B-8k | 49.0 | 47.6 | 61.6 | 15.5 | -| ChatGLM2-6B | 44.0 | 34.5 | 59.8 | 20.5 | -| ChatGLM2-6B-32k | 62.0 | 64.9 | 77.6 | 29.9 | - -#### 代ç 补全 -| | LCC | RepoBench-P | -| --- | :-: | :-: | -| GPT-3.5-Turbo-16k | 54.7 | 54.3 | -| Llama2-7B-chat-4k | 52.3 | 42.4 | -| LongChat-7B-16k | 59.2 | 54.7 | -| XGen-7B-8k | 38.8 | 37.3 | -| InternLM-7B-8k | 45.5 | 34.0 | -| ChatGLM2-6B | 48.4 | 43.7 | -| ChatGLM2-6B-32k | 55.4 | 50.0 | - -#### åˆæˆä»»åŠ¡ -| | PassageRetrieval-en | Passage Count | PassageRetrieval-zh | -| --- | :-: | :-: | :-: | -| GPT-3.5-Turbo-16k | 71.0 | 4.5 | 77.5 | -| Llama2-7B-chat-4k | 9.2 | 2.5 | 0.5 | -| LongChat-7B-16k | 24.0 | 4.5 | 4.8 | -| XGen-7B-8k | 9.0 | 2.2 | 4.2 | -| InternLM-7B-8k | 6.5 | 2.9 | 0.9 | -| ChatGLM2-6B | 3.2 | 2.1 | 5.5 | -| ChatGLM2-6B-32k | 77.5 | 2.0 | 62.5 | - -## 致谢 -- **LongBench**的部分任务基于之å‰çš„ç ”ç©¶è€…æ出的数æ®é›†æž„建,包括[HotpotQA](https://hotpotqa.github.io/),[2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/),[Musique](https://arxiv.org/abs/2108.00573),[DuReader](https://github.com/baidu/DuReader),[NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf),[Qasper](https://arxiv.org/pdf/2105.03011.pdf),[GovReport](https://arxiv.org/pdf/2104.02112.pdf),[QMSum](https://arxiv.org/pdf/2104.05938.pdf),[VCSUM](https://arxiv.org/abs/2305.05280),[TriviaQA](https://nlp.cs.washington.edu/triviaqa/),[NQ](https://ai.google.com/research/NaturalQuestions/),[TREC](https://aclanthology.org/C02-1150.pdf),[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf),[LCC](https://arxiv.org/abs/2306.14893)å’Œ[RepoBench-P](https://arxiv.org/abs/2306.03091)。 - -## 引用 -本工作由**THUKEG**å’Œ**Zhipu AI**å…±åŒå®Œæˆï¼Œç›¸å…³è®ºæ–‡æ£åœ¨æ’°å†™ä¸ï¼Œå±Šæ—¶å°†æ›´æ–°å¼•ç”¨ä¿¡æ¯ï¼Œæ•¬è¯·å…³æ³¨~ - -如果您使用该Benchmark,å¯ä»¥ä¸€å¹¶å¼•ç”¨LongBench所基于的数æ®é›†å¯¹åº”的论文,相关引用信æ¯åˆ—举在[这里](refs/ref.bib)。 +| ----------------- | :---------: | :----: | :-------------: | :-------------: | +| GPT-3.5-Turbo-16k | 23.6 | 43.3 | 52.3 | 61.2 | +| Llama2-7B-chat-4k | 19.1 | 19.6 | 35.8 | 11.6 | +| LongChat-7B-16k | 21.6 | 21.6 | 44.6 | 26.6 | +| XGen-7B-8k | 17.9 | 18.3 | 37.2 | 14.2 | +| InternLM-7B-8k | 12.4 | 16.8 | 22.3 | 33.3 | +| ChatGLM2-6B | 11.2 | 23.7 | 34.2 | 33.0 | +| ChatGLM2-6B-32k | 20.4 | 32.2 | 45.7 | 52.0 | + +#### Multi-Document QA + +| | HotpotQA | 2WikiMQA | Musique | DuReader (zh) | +| ----------------- | :------: | :------: | :-----: | :-----------: | +| GPT-3.5-Turbo-16k | 51.6 | 37.7 | 26.9 | 28.7 | +| Llama2-7B-chat-4k | 24.3 | 31.4 | 8.6 | 1.9 | +| LongChat-7B-16k | 22.4 | 16.8 | 9.1 | 19.1 | +| XGen-7B-8k | 28.3 | 21.5 | 11.5 | 9.1 | +| InternLM-7B-8k | 27.9 | 24.0 | 10.3 | 8.9 | +| ChatGLM2-6B | 20.2 | 19.6 | 5.3 | 15.2 | +| ChatGLM2-6B-32k | 44.9 | 34.9 | 22.2 | 34.3 | + +#### Summarization + +| | GovReport | QMSum | VCSUM (zh) | +| :---------------- | :-------: | :---: | :--------: | +| GPT-3.5-Turbo-16k | 29.5 | 23.4 | 16.0 | +| Llama2-7B-chat-4k | 27.3 | 20.6 | 0.2 | +| LongChat-7B-16k | 28.4 | 23.2 | 14.0 | +| XGen-7B-8k | 27.8 | 21.7 | 1.5 | +| InternLM-7B-8k | 9.8 | 16.8 | 13.0 | +| ChatGLM2-6B | 23.7 | 22.2 | 14.6 | +| ChatGLM2-6B-32k | 33.3 | 23.9 | 16.3 | + +#### Few-shot Learning + +| | TREC | NQ | TriviaQA | LSHT (zh) | +| ----------------- | :--: | :--: | :------: | :-------: | +| GPT-3.5-Turbo-16k | 68.0 | 73.0 | 87.1 | 29.2 | +| Llama2-7B-chat-4k | 60.5 | 31.4 | 59.7 | 19.8 | +| LongChat-7B-16k | 61.5 | 44.8 | 73.5 | 20.8 | +| XGen-7B-8k | 66.0 | 43.2 | 67.0 | 20.0 | +| InternLM-7B-8k | 49.0 | 47.6 | 61.6 | 15.5 | +| ChatGLM2-6B | 44.0 | 34.5 | 59.8 | 20.5 | +| ChatGLM2-6B-32k | 62.0 | 64.9 | 77.6 | 29.9 | + +#### Code Completion + +| | LCC | RepoBench-P | +| ----------------- | :--: | :---------: | +| GPT-3.5-Turbo-16k | 54.7 | 54.3 | +| Llama2-7B-chat-4k | 52.3 | 42.4 | +| LongChat-7B-16k | 59.2 | 54.7 | +| XGen-7B-8k | 38.8 | 37.3 | +| InternLM-7B-8k | 45.5 | 34.0 | +| ChatGLM2-6B | 48.4 | 43.7 | +| ChatGLM2-6B-32k | 55.4 | 50.0 | + +#### Synthetic Tasks + +| | PassageRetrieval-en | Passage Count | PassageRetrieval-zh | +| ----------------- | :-----------------: | :-----------: | :-----------------: | +| GPT-3.5-Turbo-16k | 71.0 | 4.5 | 77.5 | +| Llama2-7B-chat-4k | 9.2 | 2.5 | 0.5 | +| LongChat-7B-16k | 24.0 | 4.5 | 4.8 | +| XGen-7B-8k | 9.0 | 2.2 | 4.2 | +| InternLM-7B-8k | 6.5 | 2.9 | 0.9 | +| ChatGLM2-6B | 3.2 | 2.1 | 5.5 | +| ChatGLM2-6B-32k | 77.5 | 2.0 | 62.5 | + +## Acknowledgements + +- Some of the tasks of **LongBench** are based on the datasets proposed by previous researchers, including [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), [DuReader](https://github.com/baidu/DuReader), [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), [QMSum](https://arxiv.org/pdf/2104.05938.pdf), [VCSUM](https://arxiv.org/abs/2305.05280), [TriviaQA](https://nlp.cs.washington.edu/triviaqa/), [NQ](https://ai.google.com/research/NaturalQuestions/), [TREC](https://aclanthology.org/C02-1150.pdf), [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf), [LCC](https://arxiv.org/abs/2306.14893) and [RepoBench-P](https://arxiv.org/abs/2306.03091). + +## Citation +This work is jointly completed by **THUKEG** and **Zhipu AI**. The related paper is currently being written, and the citation information will be updated when it's ready. Please stay tuned~ + +If you use this benchmark, you can also cite the papers corresponding to the datasets that LongBench is based on. The relevant citation information is listed [here](refs/ref.bib). \ No newline at end of file diff --git a/README_EN.md b/README_EN.md deleted file mode 100644 index 629c756..0000000 --- a/README_EN.md +++ /dev/null @@ -1,187 +0,0 @@ - -<p align="center"> - 🤗 <a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank">HF Repo</a> • 📃 Paper coming soon! -</p> - -阅读 [ä¸æ–‡ç‰ˆæœ¬](README.md) - -# LongBench: A Multilingual, Multitask Benchmark Tailored for Long Context Understanding - -**LongBench** is the first comprehensive dataset for multi-language, multi-task, and comprehensive assessment of **long text understanding** capabilities of large language models. In the context of the widespread attention to the multi-language capabilities of large models, LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models' multi-language capabilities in long texts. In addition, LongBench consists of twenty different tasks, covering key long-text application scenarios such as single-document QA, multi-document QA, summaries, few-shot learning, code completion, and synthesis tasks. - -We are fully aware of the potentially high costs involved in the model evaluation process, especially in the context of long-text scenarios (such as manual annotation costs or API call costs). Therefore, we have adopted a fully automated evaluation method, aimed at measuring and evaluating the model's ability to understand long texts at the lowest cost and most effectively. - -LongBench includes 13 English tasks, 5 Chinese tasks, and 2 code tasks, with the average length of most tasks ranging from 5k to 15k. From the main task categories, LongBench includes six types of tasks, namely multi-document QA, single-document QA, summaries, Few-shot learning, synthetic tasks, and code completion. For detailed statistics and construction methods of LongBench tasks, please refer [here](task_en.md). - -| Task Type | \#English Task | \#Chinese Task | \#Code Task | -| :-------: | :--------------------: | :--------------------: | :------------------: | -| Multi-document QA | 3 | 1 | - | -| Single-document QA | 3 | 1 | - | -| Summarization | 2 | 1 | - | -| Few-shot learning | 3 | 1 | - | -| Synthetic Tasks | 2 | 1 | - | -| Code Completion | - | - | 2 | - -## Leaderboard -Here is the average score (%) of all models on various major tasks in both Chinese and English languages under the Zero-shot scenario. Please refer to this [link](task_en.md) for the evaluation metrics used for each task. - -#### English -| | Avg | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks | -| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: | -| GPT-3.5-Turbo-16k | 45.5 | 39.8 | 38.7 | 26.5 | 76.0 | 54.5 | 37.8 | -| Llama2-7B-chat-4k | 29.0 | 24.8 | 21.4 | 23.9 | 50.5 | 47.3 | 5.9 | -| LongChat-7B-16k | 33.7 | 29.3 | 16.1 | 25.8 | 59.9 | 57.0 | 14.2 | -| XGen-7B-8k | 28.7 | 24.5 | 20.4 | 24.8 | 58.7 | 38.0 | 5.6 | -| InternLM-7B-8k | 24.7 | 17.1 | 20.8 | 13.3 | 52.7 | 39.7 | 4.7 | -| ChatGLM2-6B | 26.0 | 23.1 | 15.0 | 22.9 | 46.1 | 46.1 | 2.7 | -| ChatGLM2-6B-32k | 42.7 | 32.8 | 34.0 | 28.6 | 68.1 | 52.7 | 39.8 | - -#### Chinese - -| | Avg | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks | -| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: | -| GPT-3.5-Turbo-16k | 44.5 | 61.2 | 28.7 | 16.0 | 29.2 | 54.5 | 77.5 | -| Llama2-7B-chat-4k | 13.5 | 11.6 | 1.9 | 0.2 | 19.8 | 47.3 | 0.5 | -| LongChat-7B-16k | 23.7 | 26.6 | 19.1 | 14.0 | 20.8 | 57.0 | 4.8 | -| XGen-7B-8k | 14.5 | 14.2 | 9.1 | 1.5 | 20.0 | 38.0 | 4.2 | -| InternLM-7B-8k | 18.6 | 33.3 | 8.9 | 13.0 | 15.5 | 39.7 | 0.9 | -| ChatGLM2-6B | 22.5 | 33.0 | 15.2 | 14.6 | 20.5 | 46.1 | 5.5 | -| ChatGLM2-6B-32k | 41.3 | 52.0 | 34.3 | 16.3 | 29.9 | 52.7 | 62.5 | - -#### Radar Chart of Long Text Task Capability - - - -#### Variation of Abilities under Different Text Lengths -To more specifically analyze the model's relative performance under different text lengths, the following chart shows the average relative scores on all tasks over different text length intervals. - - -> Note: Assume that the model scores x on the data within a specific length range of a task, and y on all data of that task, then the model's **relative score** for that length range is (x/y-1). To better compare the trends of different models, we shift all the lines to 0 from 0-4k. - -## How to evaluate models on LongBench - -#### Loading Data -You can download and load the **LongBench** data through the Hugging Face datasets ([🤗 HF Repo](https://huggingface.co/datasets/THUDM/LongBench)): -```python -from datasets import load_dataset - -datasets = ["hotpotqa", "2wikimqa", "musique", "dureader", "narrativeqa", "qasper", "multifieldqa_en", \ - "multifieldqa_zh", "gov_report", "qmsum", "vcsum", "trec", "nq", "triviaqa", "lsht", "passage_count", \ - "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"] - -for dataset in datasets: - data = load_dataset('THUDM/LongBench', dataset, split='test') -``` -Alternatively, you can download the folder from [this link](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip) to load the data. - -#### Data Format - -All data in **LongBench** are standardized to the following format: - -```json -{ - "input": "The input/command for the task, usually short, such as questions in QA, queries in Few-shot tasks, etc.", - "context": "The long context text required for the task, such as documents, cross-file code, few-shot samples in Few-shot tasks", - "answers": "List composed of all standard answers", - "length": "Total length of the first three items of text (counted in characters for Chinese and words for English)", - "dataset": "The name of the dataset to which this piece of data belongs", - "language": "The language of this piece of data", - "all_classes": "All categories in classification tasks, null for non-classification tasks", - "_id": "Random id for each piece of data" -} -``` - -#### Evaluation -We provide an evaluation code using ChatGLM2-6B as an example. Firstly, run the [pred.py](pred.py) under the repository: -```bash -CUDA_VISIBLE_DEVICES=0 python pred.py -``` -You can get the model outputs on all datasets in the `pred/` folder. After that, run the evaluation code of [eval.py](eval.py): -```bash -python eval.py -``` -You can get the evaluation results on various datasets in `result.json`. Please note that we provide the input format suitable for each dataset and the maximum output length limit we summarized in `config/`. You can modify them during the evaluation to better suit the model you want to evaluate. After modification, when evaluating with [pred.py](pred.py), the data will be automatically organized according to the new format to get the corresponding model output. - -## Evaluation Result on Each Dataset - -The following tables display the Zero-shot evaluation results (%) of the model on all subtask datasets, where Chinese datasets are denoted by "zh" (please refer to this [link](task_en.md) for the evaluation metrics used for each task). - -#### Single-Document QA -| | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh | -| ----------------- | :---------: | :----: | :-------------: | :-------------: | -| GPT-3.5-Turbo-16k | 23.6 | 43.3 | 52.3 | 61.2 | -| Llama2-7B-chat-4k | 19.1 | 19.6 | 35.8 | 11.6 | -| LongChat-7B-16k | 21.6 | 21.6 | 44.6 | 26.6 | -| XGen-7B-8k | 17.9 | 18.3 | 37.2 | 14.2 | -| InternLM-7B-8k | 12.4 | 16.8 | 22.3 | 33.3 | -| ChatGLM2-6B | 11.2 | 23.7 | 34.2 | 33.0 | -| ChatGLM2-6B-32k | 20.4 | 32.2 | 45.7 | 52.0 | - -#### Multi-Document QA - -| | HotpotQA | 2WikiMQA | Musique | DuReader (zh) | -| ----------------- | :------: | :------: | :-----: | :-----------: | -| GPT-3.5-Turbo-16k | 51.6 | 37.7 | 26.9 | 28.7 | -| Llama2-7B-chat-4k | 24.3 | 31.4 | 8.6 | 1.9 | -| LongChat-7B-16k | 22.4 | 16.8 | 9.1 | 19.1 | -| XGen-7B-8k | 28.3 | 21.5 | 11.5 | 9.1 | -| InternLM-7B-8k | 27.9 | 24.0 | 10.3 | 8.9 | -| ChatGLM2-6B | 20.2 | 19.6 | 5.3 | 15.2 | -| ChatGLM2-6B-32k | 44.9 | 34.9 | 22.2 | 34.3 | - -#### Summarization - -| | GovReport | QMSum | VCSUM (zh) | -| :---------------- | :-------: | :---: | :--------: | -| GPT-3.5-Turbo-16k | 29.5 | 23.4 | 16.0 | -| Llama2-7B-chat-4k | 27.3 | 20.6 | 0.2 | -| LongChat-7B-16k | 28.4 | 23.2 | 14.0 | -| XGen-7B-8k | 27.8 | 21.7 | 1.5 | -| InternLM-7B-8k | 9.8 | 16.8 | 13.0 | -| ChatGLM2-6B | 23.7 | 22.2 | 14.6 | -| ChatGLM2-6B-32k | 33.3 | 23.9 | 16.3 | - -#### Few-shot Learning - -| | TREC | NQ | TriviaQA | LSHT (zh) | -| ----------------- | :--: | :--: | :------: | :-------: | -| GPT-3.5-Turbo-16k | 68.0 | 73.0 | 87.1 | 29.2 | -| Llama2-7B-chat-4k | 60.5 | 31.4 | 59.7 | 19.8 | -| LongChat-7B-16k | 61.5 | 44.8 | 73.5 | 20.8 | -| XGen-7B-8k | 66.0 | 43.2 | 67.0 | 20.0 | -| InternLM-7B-8k | 49.0 | 47.6 | 61.6 | 15.5 | -| ChatGLM2-6B | 44.0 | 34.5 | 59.8 | 20.5 | -| ChatGLM2-6B-32k | 62.0 | 64.9 | 77.6 | 29.9 | - -#### Code Completion - -| | LCC | RepoBench-P | -| ----------------- | :--: | :---------: | -| GPT-3.5-Turbo-16k | 54.7 | 54.3 | -| Llama2-7B-chat-4k | 52.3 | 42.4 | -| LongChat-7B-16k | 59.2 | 54.7 | -| XGen-7B-8k | 38.8 | 37.3 | -| InternLM-7B-8k | 45.5 | 34.0 | -| ChatGLM2-6B | 48.4 | 43.7 | -| ChatGLM2-6B-32k | 55.4 | 50.0 | - -#### Synthetic Tasks - -| | PassageRetrieval-en | Passage Count | PassageRetrieval-zh | -| ----------------- | :-----------------: | :-----------: | :-----------------: | -| GPT-3.5-Turbo-16k | 71.0 | 4.5 | 77.5 | -| Llama2-7B-chat-4k | 9.2 | 2.5 | 0.5 | -| LongChat-7B-16k | 24.0 | 4.5 | 4.8 | -| XGen-7B-8k | 9.0 | 2.2 | 4.2 | -| InternLM-7B-8k | 6.5 | 2.9 | 0.9 | -| ChatGLM2-6B | 3.2 | 2.1 | 5.5 | -| ChatGLM2-6B-32k | 77.5 | 2.0 | 62.5 | - -## Acknowledgements - -- Some of the tasks of **LongBench** are based on the datasets proposed by previous researchers, including [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), [DuReader](https://github.com/baidu/DuReader), [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), [QMSum](https://arxiv.org/pdf/2104.05938.pdf), [VCSUM](https://arxiv.org/abs/2305.05280), [TriviaQA](https://nlp.cs.washington.edu/triviaqa/), [NQ](https://ai.google.com/research/NaturalQuestions/), [TREC](https://aclanthology.org/C02-1150.pdf), [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf), [LCC](https://arxiv.org/abs/2306.14893) and [RepoBench-P](https://arxiv.org/abs/2306.03091). - -## Citation -This work is jointly completed by **THUKEG** and **Zhipu AI**. The related paper is currently being written, and the citation information will be updated when it's ready. Please stay tuned~ - -If you use this benchmark, you can also cite the papers corresponding to the datasets that LongBench is based on. The relevant citation information is listed [here](refs/ref.bib). \ No newline at end of file diff --git a/README_ZH.md b/README_ZH.md new file mode 100644 index 0000000..3eff75c --- /dev/null +++ b/README_ZH.md @@ -0,0 +1,175 @@ + +<p align="center"> + 🤗 <a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank">HF Repo</a> • 📃 Paper coming soon! +</p> + +Read this in [English](README.md). + +# LongBench: 多è¯è¨€ã€å¤šä»»åŠ¡é•¿æ–‡æœ¬ç†è§£è¯„测基准 + +**LongBench**是第一个多è¯è¨€ã€å¤šä»»åŠ¡ã€å…¨é¢è¯„估大è¯è¨€æ¨¡åž‹**长文本ç†è§£**能力的综åˆæ•°æ®é›†ã€‚在目å‰å¤§æ¨¡åž‹å¤šè¯è¨€èƒ½åŠ›å¼•èµ·å¹¿æ³›å…³æ³¨çš„背景下,LongBench涵盖了ä¸åŒçš„è¯è¨€ï¼ˆä¸æ–‡å’Œè‹±æ–‡ï¼‰ï¼Œä»¥æ¤æ¥å¯¹å¤§æ¨¡åž‹åœ¨é•¿æ–‡æœ¬ä¸‹çš„多è¯è¨€èƒ½åŠ›è¿›è¡Œæ›´å…¨é¢çš„评估。åŒæ—¶ï¼ŒLongBench由二å个ä¸åŒçš„任务组æˆï¼Œè¦†ç›–了å•æ–‡æ¡£QAã€å¤šæ–‡æ¡£QAã€æ‘˜è¦ã€Few-shotå¦ä¹ ã€ä»£ç 补全和åˆæˆä»»åŠ¡ç‰å…³é”®çš„长文本应用场景。 + +我们深知模型评测过程ä¸å¯èƒ½äº§ç”Ÿçš„高昂æˆæœ¬ï¼Œå°¤å…¶æ˜¯é•¿æ–‡æœ¬åœºæ™¯ä¸‹ï¼ˆå¦‚äººå·¥æ ‡æ³¨æˆæœ¬æˆ–API调用æˆæœ¬ï¼‰ã€‚å› æ¤ï¼Œæˆ‘们采用了一ç§å…¨è‡ªåŠ¨çš„评测方å¼ï¼Œæ—¨åœ¨ä»¥æœ€ä½Žçš„æˆæœ¬ï¼Œæœ€æœ‰æ•ˆåœ°è¡¡é‡å’Œè¯„估模型的长文本ç†è§£èƒ½åŠ›ã€‚ + +LongBench包å«13个英文任务ã€5个ä¸æ–‡ä»»åŠ¡å’Œ2个代ç 任务,多数任务的平å‡é•¿åº¦åœ¨5k-15k之间,共包å«çº¦4500æ¡æµ‹è¯•æ•°æ®ã€‚从主è¦ä»»åŠ¡åˆ†ç±»ä¸Šï¼ŒLongBench包å«å…大类任务,å³å•æ–‡æ¡£QAã€å¤šæ–‡æ¡£QAã€æ‘˜è¦ã€Few-shotå¦ä¹ ã€ä»£ç 补全和åˆæˆä»»åŠ¡ã€‚关于LongBenchæ•°æ®é›†çš„具体统计åŠä»»åŠ¡æž„é€ æ–¹å¼è¯·å‚考[这里](task_zh.md)。 + +| 任务类型 | 英文任务数 | ä¸æ–‡ä»»åŠ¡æ•° | 代ç 任务数 | +| :----------: | :--------: | :--------: | :--------: | +| å•æ–‡æ¡£QA | 3 | 1 | - | +| 多文档QA | 3 | 1 | - | +| æ‘˜è¦ | 2 | 1 | - | +| Few-shotå¦ä¹ | 3 | 1 | - | +| åˆæˆä»»åŠ¡ | 2 | 1 | - | +| 代ç 补全 | - | - | 2 | + +## 排行榜 +我们在这里展示了所有模型在Zero-shot场景下,在ä¸æ–‡å’Œè‹±æ–‡å„大类任务上得分的平å‡å€¼ï¼ˆ%),å„ä»»åŠ¡è¯„ä¼°æ‰€ç”¨æŒ‡æ ‡è¯·å‚考[这里](task_zh.md)。 +#### è‹±æ–‡æ¦œå• +| | Avg | å•æ–‡æ¡£QA | 多文档QA | æ‘˜è¦ | Few-shotå¦ä¹ | 代ç 补全 | åˆæˆä»»åŠ¡ | +| --- | :-: | :-: | :-: | :-: | :-: | :-: | :-: | +| GPT-3.5-Turbo-16k | 45.5 | 39.8 | 38.7 | 26.5 | 76.0 | 54.5 | 37.8 | +| Llama2-7B-chat-4k | 29.0 | 24.8 | 21.4 | 23.9 | 50.5 | 47.3 | 5.9 | +| LongChat-7B-16k | 33.7 | 29.3 | 16.1 | 25.8 | 59.9 | 57.0 | 14.2 | +| XGen-7B-8k | 28.7 | 24.5 | 20.4 | 24.8 | 58.7 | 38.0 | 5.6 | +| InternLM-7B-8k | 24.7 | 17.1 | 20.8 | 13.3 | 52.7 | 39.7 | 4.7 | +| ChatGLM2-6B | 26.0 | 23.1 | 15.0 | 22.9 | 46.1 | 46.1 | 2.7 | +| ChatGLM2-6B-32k | 42.7 | 32.8 | 34.0 | 28.6 | 68.1 | 52.7 | 39.8 | + +#### ä¸æ–‡æ¦œå• +| | Avg | å•æ–‡æ¡£QA | 多文档QA | æ‘˜è¦ | Few-shotå¦ä¹ | 代ç 补全 | åˆæˆä»»åŠ¡ | +|-------|:---:|:-------------:|:------------:|:-------------:|:-----------------:|:---------------:|:----------------:| +| GPT-3.5-Turbo-16k | 44.5 | 61.2 | 28.7 | 16.0 | 29.2 | 54.5 | 77.5 | +| Llama2-7B-chat-4k | 13.5 | 11.6 | 1.9 | 0.2 | 19.8 | 47.3 | 0.5 | +| LongChat-7B-16k | 23.7 | 26.6 | 19.1 | 14.0 | 20.8 | 57.0 | 4.8 | +| XGen-7B-8k | 14.5 | 14.2 | 9.1 | 1.5 | 20.0 | 38.0 | 4.2 | +| InternLM-7B-8k | 18.6 | 33.3 | 8.9 | 13.0 | 15.5 | 39.7 | 0.9 | +| ChatGLM2-6B | 22.5 | 33.0 | 15.2 | 14.6 | 20.5 | 46.1 | 5.5 | +| ChatGLM2-6B-32k | 41.3 | 52.0 | 34.3 | 16.3 | 29.9 | 52.7 | 62.5 | + +#### 长文本任务能力雷达图 + + +#### ä¸åŒé•¿åº¦æ–‡æœ¬ä¸‹çš„能力å˜åŒ– +为了更有针对性地分æžæ¨¡åž‹åœ¨ä¸åŒæ–‡æœ¬é•¿åº¦ä¸‹çš„相对表现,下图展示了模型在ä¸åŒæ–‡æœ¬é•¿åº¦åŒºé—´ä¸Šï¼Œæ‰€æœ‰ä»»åŠ¡ä¸Šçš„å¹³å‡ç›¸å¯¹åˆ†æ•°ã€‚ + + +> 注:å‡è®¾æ¨¡åž‹åœ¨æŸä¸ªä»»åŠ¡çš„特定长度范围内数æ®ä¸Šå¾—分为x,在该任务所有数æ®ä¸Šå¾—分为y,则模型在该长度范围的**相对分数**为(x/y-1)。为了更好比较ä¸åŒæ¨¡åž‹çš„å˜åŒ–趋势,我们在0-4k将所有折线平移至0。 + +## 如何在LongBench上评测模型 + +#### è½½å…¥æ•°æ® +ä½ å¯ä»¥é€šè¿‡Hugging Face datasetsæ¥ä¸‹è½½å¹¶è½½å…¥**LongBench**çš„æ•°æ®ï¼ˆ[🤗 HF Repo](https://huggingface.co/datasets/THUDM/LongBench)): +```python +from datasets import load_dataset + +datasets = ["hotpotqa", "2wikimqa", "musique", "dureader", "narrativeqa", "qasper", "multifieldqa_en", \ + "multifieldqa_zh", "gov_report", "qmsum", "vcsum", "trec", "nq", "triviaqa", "lsht", "passage_count", \ + "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"] + +for dataset in datasets: + data = load_dataset('THUDM/LongBench', dataset, split='test') +``` +åŒæ ·åœ°ï¼Œä½ 也å¯ä»¥ç›´æŽ¥ç”¨è¿™ä¸ª[链接](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip)下载所有的评测数æ®ã€‚ + +#### æ•°æ®æ ¼å¼ +**LongBench**ä¸æ‰€æœ‰æ•°æ®éƒ½ç»Ÿä¸€ä¸ºä»¥ä¸‹æ ¼å¼ï¼š +```json +{ + "input": "任务的输入/指令,通常较çŸï¼Œæ¯”如QAä¸çš„问题ã€Few-shot任务ä¸çš„æé—®ç‰", + "context": "任务所需的长è¯å¢ƒæ–‡æœ¬ï¼Œæ¯”如文档ã€è·¨æ–‡ä»¶ä»£ç ã€Few-shot任务ä¸çš„few-shotæ ·æœ¬", + "answers": "ç”±æ‰€æœ‰æ ‡å‡†ç”案组æˆçš„列表", + "length": "å‰ä¸‰é¡¹æ–‡æœ¬çš„总长度(ä¸ã€è‹±æ–‡åˆ†åˆ«ç”¨å—ã€è¯æ•°ç»Ÿè®¡ï¼‰", + "dataset": "本æ¡æ•°æ®æ‰€å±žæ•°æ®é›†å称", + "language": "本æ¡æ•°æ®çš„è¯è¨€", + "all_classes": "分类任务ä¸çš„所有类别,éžåˆ†ç±»ä»»åŠ¡åˆ™ä¸ºnull", + "_id": "æ¯æ¡æ•°æ®çš„éšæœºid" +} +``` + +#### 评测 +我们以ChatGLM2-6B为例æ供了一份评测代ç 。首先,è¿è¡Œä»“库下的[pred.py](pred.py) +```bash +CUDA_VISIBLE_DEVICES=0 python pred.py +``` +å¯ä»¥åœ¨`pred/`文件夹下得到模型在所有数æ®é›†ä¸‹çš„输出,æ¤åŽè¿è¡Œ[eval.py](eval.py)的评测代ç : +```bash +python eval.py +``` +å¯ä»¥åœ¨`result.json`ä¸å¾—到在å„æ•°æ®é›†ä¸Šçš„评测结果。请注æ„,我们在`config/`下æ供了我们总结出æ¥çš„在å„æ•°æ®é›†ä¸Šé€‚åˆçš„è¾“å…¥æ ¼å¼å’Œæœ€å¤§è¾“出长度é™åˆ¶ï¼Œåœ¨è¯„测的时候å¯ä»¥è¿›è¡Œä¿®æ”¹ä»¥æ›´å¥½åœ°é€‚ç”¨ä½ è¦è¯„测的模型,修改åŽåœ¨[pred.py](pred.py)è¯„æµ‹æ—¶ä¼šè‡ªåŠ¨æŒ‰ç…§æ–°çš„æ ¼å¼åŽ»æ•´ç†æ•°æ®å¹¶å¾—到对应的模型输出。 + +## 详细评测结果 +下é¢çš„å‡ å¼ è¡¨æ ¼å±•ç¤ºäº†æ¨¡åž‹åœ¨æ‰€æœ‰å任务数æ®é›†ä¸Šçš„Zero-shot评测结果(%),其ä¸çš„ä¸æ–‡æ•°æ®é›†ä»¥â€œzhâ€æ ‡ç¤ºï¼ˆå„ä»»åŠ¡è¯„ä¼°æ‰€ç”¨æŒ‡æ ‡è¯·å‚考[这里](task_zh.md))。 +> 注:对于超出模型处ç†é•¿åº¦èƒ½åŠ›çš„文本,å‚考[Lost in the Middle](https://arxiv.org/abs/2307.03172)的观察,我们从文本ä¸é—´è¿›è¡Œæˆªæ–,ä¿æŒå‰åŽéƒ¨åˆ†çš„ä¿¡æ¯ã€‚实验表明,这ç§æˆªæ–æ–¹å¼å¯¹æ¨¡åž‹æ€§èƒ½å½±å“最å°ã€‚ +#### å•æ–‡æ¡£QA +| | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh | +|-------------------|:-----------:|:------:|:---------------:|:---------------:| +| GPT-3.5-Turbo-16k | 23.6 | 43.3 | 52.3 | 61.2 | +| Llama2-7B-chat-4k | 19.1 | 19.6 | 35.8 | 11.6 | +| LongChat-7B-16k | 21.6 | 21.6 | 44.6 | 26.6 | +| XGen-7B-8k | 17.9 | 18.3 | 37.2 | 14.2 | +| InternLM-7B-8k | 12.4 | 16.8 | 22.3 | 33.3 | +| ChatGLM2-6B | 11.2 | 23.7 | 34.2 | 33.0 | +| ChatGLM2-6B-32k | 20.4 | 32.2 | 45.7 | 52.0 | + +#### 多文档QA +| | HotpotQA | 2WikiMQA | Musique | DuReader (zh) | +|----------------------|:--------:|:--------:|:-------:|:--------:| +| GPT-3.5-Turbo-16k | 51.6 | 37.7 | 26.9 | 28.7 | +| Llama2-7B-chat-4k | 24.3 | 31.4 | 8.6 | 1.9 | +| LongChat-7B-16k | 22.4 | 16.8 | 9.1 | 19.1 | +| XGen-7B-8k | 28.3 | 21.5 | 11.5 | 9.1 | +| InternLM-7B-8k | 27.9 | 24.0 | 10.3 | 8.9 | +| ChatGLM2-6B | 20.2 | 19.6 | 5.3 | 15.2 | +| ChatGLM2-6B-32k | 44.9 | 34.9 | 22.2 | 34.3 | + +#### æ‘˜è¦ +| | GovReport | QMSum | VCSUM (zh) | +|:-----------|:---------:|:-----:|:-----:| +| GPT-3.5-Turbo-16k | 29.5 | 23.4 | 16.0 | +| Llama2-7B-chat-4k | 27.3 | 20.6 | 0.2 | +| LongChat-7B-16k | 28.4 | 23.2 | 14.0 | +| XGen-7B-8k | 27.8 | 21.7 | 1.5 | +| InternLM-7B-8k | 9.8 | 16.8 | 13.0 | +| ChatGLM2-6B | 23.7 | 22.2 | 14.6 | +| ChatGLM2-6B-32k | 33.3 | 23.9 | 16.3 | + +#### Few-shotå¦ä¹ +| | TREC | NQ | TriviaQA | LSHT (zh) | +| --- | :-: | :-: | :-: | :-: | +| GPT-3.5-Turbo-16k | 68.0 | 73.0 | 87.1 | 29.2 | +| Llama2-7B-chat-4k | 60.5 | 31.4 | 59.7 | 19.8 | +| LongChat-7B-16k | 61.5 | 44.8 | 73.5 | 20.8 | +| XGen-7B-8k | 66.0 | 43.2 | 67.0 | 20.0 | +| InternLM-7B-8k | 49.0 | 47.6 | 61.6 | 15.5 | +| ChatGLM2-6B | 44.0 | 34.5 | 59.8 | 20.5 | +| ChatGLM2-6B-32k | 62.0 | 64.9 | 77.6 | 29.9 | + +#### 代ç 补全 +| | LCC | RepoBench-P | +| --- | :-: | :-: | +| GPT-3.5-Turbo-16k | 54.7 | 54.3 | +| Llama2-7B-chat-4k | 52.3 | 42.4 | +| LongChat-7B-16k | 59.2 | 54.7 | +| XGen-7B-8k | 38.8 | 37.3 | +| InternLM-7B-8k | 45.5 | 34.0 | +| ChatGLM2-6B | 48.4 | 43.7 | +| ChatGLM2-6B-32k | 55.4 | 50.0 | + +#### åˆæˆä»»åŠ¡ +| | PassageRetrieval-en | Passage Count | PassageRetrieval-zh | +| --- | :-: | :-: | :-: | +| GPT-3.5-Turbo-16k | 71.0 | 4.5 | 77.5 | +| Llama2-7B-chat-4k | 9.2 | 2.5 | 0.5 | +| LongChat-7B-16k | 24.0 | 4.5 | 4.8 | +| XGen-7B-8k | 9.0 | 2.2 | 4.2 | +| InternLM-7B-8k | 6.5 | 2.9 | 0.9 | +| ChatGLM2-6B | 3.2 | 2.1 | 5.5 | +| ChatGLM2-6B-32k | 77.5 | 2.0 | 62.5 | + +## 致谢 +- **LongBench**的部分任务基于之å‰çš„ç ”ç©¶è€…æ出的数æ®é›†æž„建,包括[HotpotQA](https://hotpotqa.github.io/),[2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/),[Musique](https://arxiv.org/abs/2108.00573),[DuReader](https://github.com/baidu/DuReader),[NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf),[Qasper](https://arxiv.org/pdf/2105.03011.pdf),[GovReport](https://arxiv.org/pdf/2104.02112.pdf),[QMSum](https://arxiv.org/pdf/2104.05938.pdf),[VCSUM](https://arxiv.org/abs/2305.05280),[TriviaQA](https://nlp.cs.washington.edu/triviaqa/),[NQ](https://ai.google.com/research/NaturalQuestions/),[TREC](https://aclanthology.org/C02-1150.pdf),[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf),[LCC](https://arxiv.org/abs/2306.14893)å’Œ[RepoBench-P](https://arxiv.org/abs/2306.03091)。 + +## 引用 +本工作由**THUKEG**å’Œ**Zhipu AI**å…±åŒå®Œæˆï¼Œç›¸å…³è®ºæ–‡æ£åœ¨æ’°å†™ä¸ï¼Œå±Šæ—¶å°†æ›´æ–°å¼•ç”¨ä¿¡æ¯ï¼Œæ•¬è¯·å…³æ³¨~ + +如果您使用该Benchmark,å¯ä»¥ä¸€å¹¶å¼•ç”¨LongBench所基于的数æ®é›†å¯¹åº”的论文,相关引用信æ¯åˆ—举在[这里](refs/ref.bib)。 diff --git a/task.md b/task.md index 67a6a0d..186164f 100644 --- a/task.md +++ b/task.md @@ -1,66 +1,67 @@ -# 任务统计 +# Task statistics -| 任务 | 任务类型 | è¯„ä»·æŒ‡æ ‡ | å¹³å‡é•¿åº¦ |è¯è¨€ | Sampleæ•°é‡| -| :--------- | :-----------:| :-----------: |:---------: | :-------------: |:---------: | -| HotpotQA | 多文档QA | F1 |9,149 |英文 |200 | -| 2WikiMultihopQA| 多文档QA | F1 |4,885 |英文 |200 | -| Musique| 多文档QA | F1 |7,798 |英文 |200 | -| DuReader| 多文档QA | Rouge-L |15,768 |ä¸æ–‡ |200 | -| MultiFieldQA-en| å•æ–‡æ¡£QA | F1 |4,559 |英文 |150 | -| MultiFieldQA-zh| å•æ–‡æ¡£QA | F1 |6,771 |ä¸æ–‡ |200 | -| NarrativeQA| å•æ–‡æ¡£QA | F1 |18,405 |英文 |200 | -| Qasper| å•æ–‡æ¡£QA | F1 |3,619 |英文 |200 | -| GovReport| æ‘˜è¦ | Rouge-L |8,169 |英文 |200 | -| QMSum| æ‘˜è¦ | Rouge-L |10,546 |英文 |200 | -| VCSUM| æ‘˜è¦ | Rouge-L |15,147 |ä¸æ–‡ |200 | -| TriviaQA| Few shot | F1 |8,015 |英文 |200 | -| NQ| Few shot | F1 |8,210 |英文 |200 | -| TREC| Few shot | Accuracy |5,176 |英文 |200 | -| LSHT| Few shot | Accuracy |22,333 |ä¸æ–‡ |200 | -| PassageRetrieval-en| åˆæˆä»»åŠ¡ | Accuracy |9,288 |英文 |200 | -| PassageCount| åˆæˆä»»åŠ¡ | Accuracy |11,141 |英文 |200 | -| PassageRetrieval-zh | åˆæˆä»»åŠ¡ | Accuracy |6,745 |ä¸æ–‡ |200 | -| LCC| 代ç | Edit Sim |1,235 |Python/C#/Java |500 | -| RepoBench-P| 代ç | Edit Sim |5,622 |Python/Java |500 | +| Task | Task Type | Eval metric | Avg len |Language | \#Sample | +| --------- | -------------| ------------- |--------- | ------------- |--------- | +| HotpotQA | Multi-doc QA | F1 |9149 |EN |200 | +| 2WikiMultihopQA| Multi-doc QA | F1 |4885 |EN |200 | +| Musique| Multi-doc QA | F1 |7798 |EN |200 | +| DuReader| Multi-doc QA | Rouge-L |15768 |ZH |200 | +| MultiFieldQA-en| Single-doc QA | F1 |4559 |EN |150 | +| MultiFieldQA-zh| Single-doc QA | F1 |6771 |ZH |200 | +| NarrativeQA| Single-doc QA | F1 |18405 |EN |200 | +| Qasper| Single-doc QA | F1 |3619 |EN |200 | +| GovReport| Summarization | Rouge-L |8169 |EN |200 | +| QMSum| Summarization | Rouge-L |10546 |EN |200 | +| VCSUM| Summarization | Rouge-L |15147 |ZH |200 | +| TriviaQA| Few shot | F1 |8015 |EN |200 | +| NQ| Few shot | F1 |8210 |EN |200 | +| TREC| Few shot | Accuracy |5176 |EN |200 | +| LSHT| Few shot | Accuracy |22333 |ZH |200 | +| PassageRetrieval-en| Synthetic | Accuracy |9288 |EN |200 | +| PassageCount| Synthetic | Accuracy |11141 |EN |200 | +| PassageRetrieval-zh | Synthetic | Accuracy |6745 |ZH |200 | +| LCC| Code | Edit Sim |1235 |Python/C#/Java |500 | +| RepoBench-P| Code | Edit Sim |5622 |Python/Java |500 | -> 注:为了é¿å…ä¸åŒTokenizer统计的差è·ï¼Œæˆ‘们使用å•è¯æ•°ï¼ˆPythonçš„split函数)æ¥ç»Ÿè®¡è‹±æ–‡æ•°æ®é›†å’Œä»£ç æ•°æ®é›†çš„å¹³å‡é•¿åº¦ï¼Œä½¿ç”¨æ±‰å—æ•°æ¥ç»Ÿè®¡ä¸æ–‡æ•°æ®é›†çš„å¹³å‡é•¿åº¦ã€‚ +> Note: In order to avoid discrepancies caused by different tokenizers, we use the word count (using Python's split function) to calculate the average length of English datasets and code datasets, and use the character count to calculate the average length of Chinese datasets. -# 任务说明 +# Task description -| 任务 | 任务说明 | -| :----------------- | :----------------------------------------------------------- | -| HotpotQA | 基于多篇给定的文档,回ç”相关问题 | -| 2WikiMultihopQA | 基于多篇给定的文档,回ç”相关问题 | -| Musique | 基于多篇给定的文档,回ç”相关问题 | -| DuReader | 基于多篇给定的检索文档,回ç”相关的ä¸æ–‡é—®é¢˜ | -| MultiFieldQA-en | 基于å•ç¯‡æ–‡æ¡£ï¼Œå›žç”英文问题,文档所属的领域相对多元 | -| MultiFieldQA-zh | 基于å•ç¯‡æ–‡æ¡£ï¼Œå›žç”ä¸æ–‡é—®é¢˜ï¼Œæ–‡æ¡£æ‰€å±žçš„领域相对多元 | -| NarrativeQA | 基于故事或剧本æ问,包括对人物ã€æƒ…节ã€ä¸»é¢˜ç‰é‡è¦å…ƒç´ çš„ç†è§£ | -| Qasper | 基于å•ç¯‡è®ºæ–‡çš„æ出,问题由NLP的读者æ出,并由NLPä»Žä¸šè€…å›žç” | -| GovReport | 摘è¦ä»»åŠ¡ï¼Œè¦æ±‚å¯¹æ”¿åºœçš„å·¥ä½œæŠ¥å‘Šè¿›è¡Œæ€»ç»“æ‘˜è¦ | -| QMSum | 摘è¦ä»»åŠ¡ï¼Œè¦æ±‚åŸºäºŽç”¨æˆ·çš„æŸ¥è¯¢å¯¹ä¼šè®®è®°å½•è¿›è¡Œæ‘˜è¦ | -| VCSUM | 摘è¦ä»»åŠ¡ï¼Œè¦æ±‚对ä¸æ–‡ä¼šè®®è®°å½•è¿›è¡Œæ€»ç»“æ‘˜è¦ | -| TriviaQA | å•æ–‡æ¡£é—®ç”任务,æ供若干的Few Shotæ ·ä¾‹ | -| NQ | å•æ–‡æ¡£é—®ç”任务,æ供若干的Few Shotæ ·ä¾‹ | -| TREC | 分类任务,è¦æ±‚对问题进行分类,一共包å«50个类别 | -| LSHT | ä¸æ–‡åˆ†ç±»ä»»åŠ¡ï¼Œè¦æ±‚对新闻进行分类,一共包å«24个类别 | -| PassageRetrieval-en | 给定30个英文维基的段è½ï¼Œåˆ¤æ–给定的摘è¦å±žäºŽå“ªä¸ªæ®µè½ | -| PassageCount | 判æ–给定的若干的段è½ä¸ä¸é‡å¤çš„段è½ä¸€å…±æœ‰å‡ 个 | -| PassageRetrieval-zh | 给定若干个出自C4æ•°æ®é›†çš„ä¸æ–‡æ®µè½ï¼Œåˆ¤æ–给定的摘è¦å±žäºŽå“ªä¸ªæ®µè½ | -| LCC | 给定一段较长代ç ,è¦æ±‚预测出下一行代ç | -| RepoBench-P | 给定一个github仓库内多个文件ä¸çš„代ç (包å«æ–‡ä»¶é—´ä¾èµ–),è¦æ±‚预测出下一行代ç | +| Task | Task Description | +| ----------------- | ------------------------------------------------------------ | +| HotpotQA | Answer related questions based on multiple given documents | +| 2WikiMultihopQA | Answer related questions based on multiple given documents | +| Musique | Answer related questions based on multiple given documents | +| DuReader | Answer related Chinese questions based on multiple retrieved documents | +| MultiFieldQA-en | Answer English questions based on a single document, which comes from a relatively diverse field | +| MultiFieldQA-zh | Answer Chinese questions based on a single document, which comes from a relatively diverse field | +| NarrativeQA | Ask questions based on stories or scripts, including understanding of important elements such as characters, plots, themes, etc. | +| Qasper | Ask questions based on a single paper, questions proposed by NLP readers, and answered by NLP practitioners | +| GovReport | A summarization task that requires summarizing government work reports | +| QMSum | A summarization task that requires summarizing meeting records based on user queries | +| VCSUM | A summarization task that requires summarizing Chinese meeting records | +| TriviaQA | Single document question answering task, providing several Few Shot examples | +| NQ | Single document question answering task, providing several Few Shot examples | +| TREC | A classification task that requires categorizing questions, includes 50 categories in total | +| LSHT | A Chinese classification task that requires categorizing news, includes 24 categories in total | +| PassageRetrieval-en | Given 30 English Wikipedia paragraphs, determine which paragraph the given summary belongs to | +| PassageCount | Determine the number of non-repeating paragraphs in a given number of paragraphs | +| PassageRetrieval-zh | Given several Chinese paragraphs from the C4 data set, determine which paragraph the given abstract belongs to | +| LCC | Given a longer piece of code, predict the next line of code | +| RepoBench-P | Given code in multiple files within a GitHub repository (including inter-file dependencies), predict the next line of code | -# æ•°æ®æž„é€ æ–¹å¼ -> 注:对于所有基于已有数æ®é›†æž„é€ çš„ä»»åŠ¡ï¼Œæˆ‘ä»¬å‡é€‰ç”¨åŽŸæœ‰æ•°æ®é›†çš„验è¯é›†æˆ–测试集的数æ®ï¼ˆVCSUM任务除外) +# Task construction -- [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573)å’Œ[DuReader](https://github.com/baidu/DuReader)任务基于原始的数æ®é›†æž„建,并进行相关处ç†ä½¿å…¶é€‚用于长文本评测。具体地,对于验è¯é›†ä¸çš„问题,我们会选å–包å«ç”案的evidence passageå’Œè‹¥å¹²å¹²æ‰°çš„æ–‡ç« ï¼Œè¿™äº›æ–‡ç« å’ŒåŽŸå§‹çš„é—®é¢˜å…±åŒç»„æˆäº†ç›¸å…³ä»»åŠ¡çš„输入。 -- MultiFiedQA-zhå’ŒMultiFieldQA-en任务由约10ç§æ¥æºçš„长文本数æ®ç»„æˆï¼ŒåŒ…å«Latex论文ã€è£åˆ¤æ–‡ä¹¦ã€æ”¿åºœå·¥ä½œæŠ¥å‘Šå’Œè°·æŒç´¢å¼•çš„PDF文档ç‰ã€‚对于æ¯ç¯‡é•¿æ–‡æœ¬ï¼Œæˆ‘们邀请了若干åšå£«ç”Ÿå’Œç¡•å£«ç”Ÿæ¥è¿›è¡Œæ ‡æ³¨ï¼Œå³åŸºäºŽé•¿æ–‡æœ¬æ问,并给出æ£ç¡®çš„ç”案。为了更好地进行自动化评测,我们è¦æ±‚æ ‡æ³¨å‘˜å°½å¯èƒ½æ出有确定性ç”案的问题。 -- [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf)å’Œ[QMSum](https://arxiv.org/pdf/2104.05938.pdf)任务直接使用原论文æ供的数æ®ã€‚在具体的构建ä¸ï¼Œæˆ‘们使用[ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/)æ供的模æ¿æ¥å°†å¯¹åº”çš„æ•°æ®è½¬æ¢ä¸ºçº¯æ–‡æœ¬çš„输入。 -- [VCSUM](https://arxiv.org/abs/2305.05280)任务基于原始的数æ®é›†æž„建,我们针对该数æ®è®¾è®¡äº†ç›¸åº”的模æ¿å°†å¯¹åº”çš„æ•°æ®è½¬æ¢ä¸ºçº¯æ–‡æœ¬çš„输入。 -- [TriviaQA](https://nlp.cs.washington.edu/triviaqa/)å’Œ[NQ](https://ai.google.com/research/NaturalQuestions/)任务å‚考[CoLT5](https://arxiv.org/abs/2303.09752)çš„æ–¹å¼è¿›è¡Œæž„建,å³ä¼šæ供若干基于文档进行问ç”çš„æ ·ä¾‹ï¼Œå¹¶è¦æ±‚è¯è¨€æ¨¡åž‹åŸºäºŽæ–°çš„文档回ç”相关问题。 -- [TREC](https://aclanthology.org/C02-1150.pdf)å’Œ[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf)任务基于原始的数æ®é›†æž„建。对于验è¯é›†ä¸çš„æ¯ä¸ªé—®é¢˜ï¼Œæˆ‘ä»¬é‡‡æ ·è®ç»ƒé›†ä¸çš„若干数æ®ç»„æˆFew-shotæ ·ä¾‹ã€‚è¿™äº›æ ·ä¾‹ä¼šå’ŒéªŒè¯é›†ä¸çš„问题共åŒç»„æˆè¯¥ä»»åŠ¡çš„输入。 -- PassageRetrieval-enä»»åŠ¡åŸºäºŽè‹±æ–‡ç»´åŸºè¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘们éšæœºé‡‡æ ·30段英文维基的段è½ï¼Œå¹¶é€‰å–å…¶ä¸ä¸€æ®µè¿›è¡Œæ‘˜è¦ï¼ˆä½¿ç”¨GPT-3.5-Turbo)。该任务è¦æ±‚模型给出摘è¦åº”该对应哪个的原始段è½ã€‚ -- PassageCountä»»åŠ¡åŸºäºŽè‹±æ–‡ç»´åŸºè¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘们éšæœºé‡‡æ ·è‹¥å¹²è‹±æ–‡ç»´åŸºçš„段è½ï¼Œå¹¶å°†å…¶ä¸çš„æ¯ä¸ªæ®µè½éšæœºé‡å¤è‹¥å¹²æ¬¡ï¼Œæœ€åŽå°†æ®µè½éšæœºæ‰“乱。该任务è¦æ±‚模型判æ–给定的若干的段è½ä¸ä¸é‡å¤çš„段è½ä¸€å…±æœ‰å‡ 个。 -- PassageRetrieval-zh任务基于[C4](https://arxiv.org/abs/1910.10683)è¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘们éšæœºé‡‡æ ·è‹¥å¹²æ®µæ¥è‡ªäºŽC4çš„ä¸æ–‡æ®µè½ï¼Œå¹¶é€‰å–å…¶ä¸ä¸€æ®µè¿›è¡Œæ‘˜è¦ï¼ˆä½¿ç”¨GPT-3.5-Turbo)。该任务è¦æ±‚模型给出摘è¦åº”该对应哪个的原始段è½ã€‚ -- [LCC](https://arxiv.org/abs/2306.14893)任务我们基于原始的代ç 补全数æ®é›†é‡‡æ ·æž„建。[RepoBench-P](https://arxiv.org/abs/2306.03091)任务ä¸æˆ‘们选å–了原数æ®é›†æœ€å…·æŒ‘战性的XF-F(Cross-File-First)设定,并且å‚考原文ä¸çš„Oracle-Filled场景,对于æ¯ä¸€æ¡åŽŸå§‹æ•°æ®æˆ‘们éšæœºæŠ½å–包括有效跨文件代ç 片段(gold snippet)在内的多个跨文件代ç 片段,将其拼接åŽä½œä¸ºè¾“入,è¦æ±‚模型从其ä¸åˆ©ç”¨æœ‰æ•ˆçš„跨文件代ç 以补全当å‰æ–‡ä»¶ä¸çš„代ç 。 \ No newline at end of file +> Note: For all tasks constructed from existing datasets, we use data from the validation or test set of the existing dataset (except for VCSUM). + +- The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to make them suitable for long text evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the related tasks. +- The tasks of MultiFiedQA-zh and MultiFieldQA-en consist of long-text data from about 10 sources, including Latex papers, judicial documents, government work reports, and PDF documents indexed by Google. For each long text, we invite several PhD and master students to annotate, i.e., to ask questions based on the long text and give the correct answers. To better automate evaluation, we require the annotators to ask questions with definitive answers as much as possible. +- The tasks of [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), and [QMSum](https://arxiv.org/pdf/2104.05938.pdf) directly use the data provided by the original papers. In the specific construction, we use the template provided by [ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/) to convert the corresponding data into pure text input. +- The [VCSUM](https://arxiv.org/abs/2305.05280) task is built based on the original dataset, and we have designed a corresponding template to convert the corresponding data into pure text input. +- The tasks of [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) and [NQ](https://ai.google.com/research/NaturalQuestions/) are constructed in the manner of [CoLT5](https://arxiv.org/abs/2303.09752), which provides several examples of question and answering based on documents, and requires the language model to answer related questions based on new documents. +- The tasks of [TREC](https://aclanthology.org/C02-1150.pdf) and [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf) are built based on the original datasets. For each question in the validation set, we sample several data from the training set to form few-shot examples. These examples together with the questions in the validation set constitute the input for this task. +- The PassageRetrieval-en task is constructed based on English Wikipedia. For each piece of data, we randomly sample 30 paragraphs from English Wikipedia and select one for summarization (using GPT3.5 Turbo). The task requires the model to specify which original paragraph the summary corresponds to. +- The PassageCount task is constructed based on the English wiki. For each piece of data, we randomly sample several passages from English Wikipedia, repeat each paragraph at random several times, and finally shuffle the paragraphs at random. This task requires the model to determine how many paragraphs are not repeated in a given number of paragraphs. +- The PasskeyRetrieval-zh task is constructed based on [C4](https://arxiv.org/abs/1910.10683). For each piece of data, we randomly sample several Chinese paragraphs from C4 and select one of them for summary (using GPT3.5 Turbo). This task requires the model to give the original paragraph to which the summary should correspond. +- For the [LCC](https://arxiv.org/abs/2306.14893) task, we sample from the original code completion dataset. In the [RepoBench-P](https://arxiv.org/abs/2306.03091) task, we select the most challenging XF-F (Cross-File-First) setting from the original dataset and refer to the Oracle-Filled scenario in the original text. For each original piece of data, we randomly extract multiple cross-file code snippets, including the effective cross-file code snippet (gold snippet), and concatenate them as input, requiring the model to use the effective cross-file code to complete the current file's code. \ No newline at end of file diff --git a/task_en.md b/task_en.md deleted file mode 100644 index 186164f..0000000 --- a/task_en.md +++ /dev/null @@ -1,67 +0,0 @@ -# Task statistics - -| Task | Task Type | Eval metric | Avg len |Language | \#Sample | -| --------- | -------------| ------------- |--------- | ------------- |--------- | -| HotpotQA | Multi-doc QA | F1 |9149 |EN |200 | -| 2WikiMultihopQA| Multi-doc QA | F1 |4885 |EN |200 | -| Musique| Multi-doc QA | F1 |7798 |EN |200 | -| DuReader| Multi-doc QA | Rouge-L |15768 |ZH |200 | -| MultiFieldQA-en| Single-doc QA | F1 |4559 |EN |150 | -| MultiFieldQA-zh| Single-doc QA | F1 |6771 |ZH |200 | -| NarrativeQA| Single-doc QA | F1 |18405 |EN |200 | -| Qasper| Single-doc QA | F1 |3619 |EN |200 | -| GovReport| Summarization | Rouge-L |8169 |EN |200 | -| QMSum| Summarization | Rouge-L |10546 |EN |200 | -| VCSUM| Summarization | Rouge-L |15147 |ZH |200 | -| TriviaQA| Few shot | F1 |8015 |EN |200 | -| NQ| Few shot | F1 |8210 |EN |200 | -| TREC| Few shot | Accuracy |5176 |EN |200 | -| LSHT| Few shot | Accuracy |22333 |ZH |200 | -| PassageRetrieval-en| Synthetic | Accuracy |9288 |EN |200 | -| PassageCount| Synthetic | Accuracy |11141 |EN |200 | -| PassageRetrieval-zh | Synthetic | Accuracy |6745 |ZH |200 | -| LCC| Code | Edit Sim |1235 |Python/C#/Java |500 | -| RepoBench-P| Code | Edit Sim |5622 |Python/Java |500 | - -> Note: In order to avoid discrepancies caused by different tokenizers, we use the word count (using Python's split function) to calculate the average length of English datasets and code datasets, and use the character count to calculate the average length of Chinese datasets. - -# Task description - -| Task | Task Description | -| ----------------- | ------------------------------------------------------------ | -| HotpotQA | Answer related questions based on multiple given documents | -| 2WikiMultihopQA | Answer related questions based on multiple given documents | -| Musique | Answer related questions based on multiple given documents | -| DuReader | Answer related Chinese questions based on multiple retrieved documents | -| MultiFieldQA-en | Answer English questions based on a single document, which comes from a relatively diverse field | -| MultiFieldQA-zh | Answer Chinese questions based on a single document, which comes from a relatively diverse field | -| NarrativeQA | Ask questions based on stories or scripts, including understanding of important elements such as characters, plots, themes, etc. | -| Qasper | Ask questions based on a single paper, questions proposed by NLP readers, and answered by NLP practitioners | -| GovReport | A summarization task that requires summarizing government work reports | -| QMSum | A summarization task that requires summarizing meeting records based on user queries | -| VCSUM | A summarization task that requires summarizing Chinese meeting records | -| TriviaQA | Single document question answering task, providing several Few Shot examples | -| NQ | Single document question answering task, providing several Few Shot examples | -| TREC | A classification task that requires categorizing questions, includes 50 categories in total | -| LSHT | A Chinese classification task that requires categorizing news, includes 24 categories in total | -| PassageRetrieval-en | Given 30 English Wikipedia paragraphs, determine which paragraph the given summary belongs to | -| PassageCount | Determine the number of non-repeating paragraphs in a given number of paragraphs | -| PassageRetrieval-zh | Given several Chinese paragraphs from the C4 data set, determine which paragraph the given abstract belongs to | -| LCC | Given a longer piece of code, predict the next line of code | -| RepoBench-P | Given code in multiple files within a GitHub repository (including inter-file dependencies), predict the next line of code | - - -# Task construction - -> Note: For all tasks constructed from existing datasets, we use data from the validation or test set of the existing dataset (except for VCSUM). - -- The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to make them suitable for long text evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the related tasks. -- The tasks of MultiFiedQA-zh and MultiFieldQA-en consist of long-text data from about 10 sources, including Latex papers, judicial documents, government work reports, and PDF documents indexed by Google. For each long text, we invite several PhD and master students to annotate, i.e., to ask questions based on the long text and give the correct answers. To better automate evaluation, we require the annotators to ask questions with definitive answers as much as possible. -- The tasks of [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), and [QMSum](https://arxiv.org/pdf/2104.05938.pdf) directly use the data provided by the original papers. In the specific construction, we use the template provided by [ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/) to convert the corresponding data into pure text input. -- The [VCSUM](https://arxiv.org/abs/2305.05280) task is built based on the original dataset, and we have designed a corresponding template to convert the corresponding data into pure text input. -- The tasks of [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) and [NQ](https://ai.google.com/research/NaturalQuestions/) are constructed in the manner of [CoLT5](https://arxiv.org/abs/2303.09752), which provides several examples of question and answering based on documents, and requires the language model to answer related questions based on new documents. -- The tasks of [TREC](https://aclanthology.org/C02-1150.pdf) and [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf) are built based on the original datasets. For each question in the validation set, we sample several data from the training set to form few-shot examples. These examples together with the questions in the validation set constitute the input for this task. -- The PassageRetrieval-en task is constructed based on English Wikipedia. For each piece of data, we randomly sample 30 paragraphs from English Wikipedia and select one for summarization (using GPT3.5 Turbo). The task requires the model to specify which original paragraph the summary corresponds to. -- The PassageCount task is constructed based on the English wiki. For each piece of data, we randomly sample several passages from English Wikipedia, repeat each paragraph at random several times, and finally shuffle the paragraphs at random. This task requires the model to determine how many paragraphs are not repeated in a given number of paragraphs. -- The PasskeyRetrieval-zh task is constructed based on [C4](https://arxiv.org/abs/1910.10683). For each piece of data, we randomly sample several Chinese paragraphs from C4 and select one of them for summary (using GPT3.5 Turbo). This task requires the model to give the original paragraph to which the summary should correspond. -- For the [LCC](https://arxiv.org/abs/2306.14893) task, we sample from the original code completion dataset. In the [RepoBench-P](https://arxiv.org/abs/2306.03091) task, we select the most challenging XF-F (Cross-File-First) setting from the original dataset and refer to the Oracle-Filled scenario in the original text. For each original piece of data, we randomly extract multiple cross-file code snippets, including the effective cross-file code snippet (gold snippet), and concatenate them as input, requiring the model to use the effective cross-file code to complete the current file's code. \ No newline at end of file diff --git a/task_zh.md b/task_zh.md new file mode 100644 index 0000000..67a6a0d --- /dev/null +++ b/task_zh.md @@ -0,0 +1,66 @@ +# 任务统计 + +| 任务 | 任务类型 | è¯„ä»·æŒ‡æ ‡ | å¹³å‡é•¿åº¦ |è¯è¨€ | Sampleæ•°é‡| +| :--------- | :-----------:| :-----------: |:---------: | :-------------: |:---------: | +| HotpotQA | 多文档QA | F1 |9,149 |英文 |200 | +| 2WikiMultihopQA| 多文档QA | F1 |4,885 |英文 |200 | +| Musique| 多文档QA | F1 |7,798 |英文 |200 | +| DuReader| 多文档QA | Rouge-L |15,768 |ä¸æ–‡ |200 | +| MultiFieldQA-en| å•æ–‡æ¡£QA | F1 |4,559 |英文 |150 | +| MultiFieldQA-zh| å•æ–‡æ¡£QA | F1 |6,771 |ä¸æ–‡ |200 | +| NarrativeQA| å•æ–‡æ¡£QA | F1 |18,405 |英文 |200 | +| Qasper| å•æ–‡æ¡£QA | F1 |3,619 |英文 |200 | +| GovReport| æ‘˜è¦ | Rouge-L |8,169 |英文 |200 | +| QMSum| æ‘˜è¦ | Rouge-L |10,546 |英文 |200 | +| VCSUM| æ‘˜è¦ | Rouge-L |15,147 |ä¸æ–‡ |200 | +| TriviaQA| Few shot | F1 |8,015 |英文 |200 | +| NQ| Few shot | F1 |8,210 |英文 |200 | +| TREC| Few shot | Accuracy |5,176 |英文 |200 | +| LSHT| Few shot | Accuracy |22,333 |ä¸æ–‡ |200 | +| PassageRetrieval-en| åˆæˆä»»åŠ¡ | Accuracy |9,288 |英文 |200 | +| PassageCount| åˆæˆä»»åŠ¡ | Accuracy |11,141 |英文 |200 | +| PassageRetrieval-zh | åˆæˆä»»åŠ¡ | Accuracy |6,745 |ä¸æ–‡ |200 | +| LCC| 代ç | Edit Sim |1,235 |Python/C#/Java |500 | +| RepoBench-P| 代ç | Edit Sim |5,622 |Python/Java |500 | + +> 注:为了é¿å…ä¸åŒTokenizer统计的差è·ï¼Œæˆ‘们使用å•è¯æ•°ï¼ˆPythonçš„split函数)æ¥ç»Ÿè®¡è‹±æ–‡æ•°æ®é›†å’Œä»£ç æ•°æ®é›†çš„å¹³å‡é•¿åº¦ï¼Œä½¿ç”¨æ±‰å—æ•°æ¥ç»Ÿè®¡ä¸æ–‡æ•°æ®é›†çš„å¹³å‡é•¿åº¦ã€‚ + +# 任务说明 + +| 任务 | 任务说明 | +| :----------------- | :----------------------------------------------------------- | +| HotpotQA | 基于多篇给定的文档,回ç”相关问题 | +| 2WikiMultihopQA | 基于多篇给定的文档,回ç”相关问题 | +| Musique | 基于多篇给定的文档,回ç”相关问题 | +| DuReader | 基于多篇给定的检索文档,回ç”相关的ä¸æ–‡é—®é¢˜ | +| MultiFieldQA-en | 基于å•ç¯‡æ–‡æ¡£ï¼Œå›žç”英文问题,文档所属的领域相对多元 | +| MultiFieldQA-zh | 基于å•ç¯‡æ–‡æ¡£ï¼Œå›žç”ä¸æ–‡é—®é¢˜ï¼Œæ–‡æ¡£æ‰€å±žçš„领域相对多元 | +| NarrativeQA | 基于故事或剧本æ问,包括对人物ã€æƒ…节ã€ä¸»é¢˜ç‰é‡è¦å…ƒç´ çš„ç†è§£ | +| Qasper | 基于å•ç¯‡è®ºæ–‡çš„æ出,问题由NLP的读者æ出,并由NLPä»Žä¸šè€…å›žç” | +| GovReport | 摘è¦ä»»åŠ¡ï¼Œè¦æ±‚å¯¹æ”¿åºœçš„å·¥ä½œæŠ¥å‘Šè¿›è¡Œæ€»ç»“æ‘˜è¦ | +| QMSum | 摘è¦ä»»åŠ¡ï¼Œè¦æ±‚åŸºäºŽç”¨æˆ·çš„æŸ¥è¯¢å¯¹ä¼šè®®è®°å½•è¿›è¡Œæ‘˜è¦ | +| VCSUM | 摘è¦ä»»åŠ¡ï¼Œè¦æ±‚对ä¸æ–‡ä¼šè®®è®°å½•è¿›è¡Œæ€»ç»“æ‘˜è¦ | +| TriviaQA | å•æ–‡æ¡£é—®ç”任务,æ供若干的Few Shotæ ·ä¾‹ | +| NQ | å•æ–‡æ¡£é—®ç”任务,æ供若干的Few Shotæ ·ä¾‹ | +| TREC | 分类任务,è¦æ±‚对问题进行分类,一共包å«50个类别 | +| LSHT | ä¸æ–‡åˆ†ç±»ä»»åŠ¡ï¼Œè¦æ±‚对新闻进行分类,一共包å«24个类别 | +| PassageRetrieval-en | 给定30个英文维基的段è½ï¼Œåˆ¤æ–给定的摘è¦å±žäºŽå“ªä¸ªæ®µè½ | +| PassageCount | 判æ–给定的若干的段è½ä¸ä¸é‡å¤çš„段è½ä¸€å…±æœ‰å‡ 个 | +| PassageRetrieval-zh | 给定若干个出自C4æ•°æ®é›†çš„ä¸æ–‡æ®µè½ï¼Œåˆ¤æ–给定的摘è¦å±žäºŽå“ªä¸ªæ®µè½ | +| LCC | 给定一段较长代ç ,è¦æ±‚预测出下一行代ç | +| RepoBench-P | 给定一个github仓库内多个文件ä¸çš„代ç (包å«æ–‡ä»¶é—´ä¾èµ–),è¦æ±‚预测出下一行代ç | + +# æ•°æ®æž„é€ æ–¹å¼ + +> 注:对于所有基于已有数æ®é›†æž„é€ çš„ä»»åŠ¡ï¼Œæˆ‘ä»¬å‡é€‰ç”¨åŽŸæœ‰æ•°æ®é›†çš„验è¯é›†æˆ–测试集的数æ®ï¼ˆVCSUM任务除外) + +- [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573)å’Œ[DuReader](https://github.com/baidu/DuReader)任务基于原始的数æ®é›†æž„建,并进行相关处ç†ä½¿å…¶é€‚用于长文本评测。具体地,对于验è¯é›†ä¸çš„问题,我们会选å–包å«ç”案的evidence passageå’Œè‹¥å¹²å¹²æ‰°çš„æ–‡ç« ï¼Œè¿™äº›æ–‡ç« å’ŒåŽŸå§‹çš„é—®é¢˜å…±åŒç»„æˆäº†ç›¸å…³ä»»åŠ¡çš„输入。 +- MultiFiedQA-zhå’ŒMultiFieldQA-en任务由约10ç§æ¥æºçš„长文本数æ®ç»„æˆï¼ŒåŒ…å«Latex论文ã€è£åˆ¤æ–‡ä¹¦ã€æ”¿åºœå·¥ä½œæŠ¥å‘Šå’Œè°·æŒç´¢å¼•çš„PDF文档ç‰ã€‚对于æ¯ç¯‡é•¿æ–‡æœ¬ï¼Œæˆ‘们邀请了若干åšå£«ç”Ÿå’Œç¡•å£«ç”Ÿæ¥è¿›è¡Œæ ‡æ³¨ï¼Œå³åŸºäºŽé•¿æ–‡æœ¬æ问,并给出æ£ç¡®çš„ç”案。为了更好地进行自动化评测,我们è¦æ±‚æ ‡æ³¨å‘˜å°½å¯èƒ½æ出有确定性ç”案的问题。 +- [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf)å’Œ[QMSum](https://arxiv.org/pdf/2104.05938.pdf)任务直接使用原论文æ供的数æ®ã€‚在具体的构建ä¸ï¼Œæˆ‘们使用[ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/)æ供的模æ¿æ¥å°†å¯¹åº”çš„æ•°æ®è½¬æ¢ä¸ºçº¯æ–‡æœ¬çš„输入。 +- [VCSUM](https://arxiv.org/abs/2305.05280)任务基于原始的数æ®é›†æž„建,我们针对该数æ®è®¾è®¡äº†ç›¸åº”的模æ¿å°†å¯¹åº”çš„æ•°æ®è½¬æ¢ä¸ºçº¯æ–‡æœ¬çš„输入。 +- [TriviaQA](https://nlp.cs.washington.edu/triviaqa/)å’Œ[NQ](https://ai.google.com/research/NaturalQuestions/)任务å‚考[CoLT5](https://arxiv.org/abs/2303.09752)çš„æ–¹å¼è¿›è¡Œæž„建,å³ä¼šæ供若干基于文档进行问ç”çš„æ ·ä¾‹ï¼Œå¹¶è¦æ±‚è¯è¨€æ¨¡åž‹åŸºäºŽæ–°çš„文档回ç”相关问题。 +- [TREC](https://aclanthology.org/C02-1150.pdf)å’Œ[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf)任务基于原始的数æ®é›†æž„建。对于验è¯é›†ä¸çš„æ¯ä¸ªé—®é¢˜ï¼Œæˆ‘ä»¬é‡‡æ ·è®ç»ƒé›†ä¸çš„若干数æ®ç»„æˆFew-shotæ ·ä¾‹ã€‚è¿™äº›æ ·ä¾‹ä¼šå’ŒéªŒè¯é›†ä¸çš„问题共åŒç»„æˆè¯¥ä»»åŠ¡çš„输入。 +- PassageRetrieval-enä»»åŠ¡åŸºäºŽè‹±æ–‡ç»´åŸºè¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘们éšæœºé‡‡æ ·30段英文维基的段è½ï¼Œå¹¶é€‰å–å…¶ä¸ä¸€æ®µè¿›è¡Œæ‘˜è¦ï¼ˆä½¿ç”¨GPT-3.5-Turbo)。该任务è¦æ±‚模型给出摘è¦åº”该对应哪个的原始段è½ã€‚ +- PassageCountä»»åŠ¡åŸºäºŽè‹±æ–‡ç»´åŸºè¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘们éšæœºé‡‡æ ·è‹¥å¹²è‹±æ–‡ç»´åŸºçš„段è½ï¼Œå¹¶å°†å…¶ä¸çš„æ¯ä¸ªæ®µè½éšæœºé‡å¤è‹¥å¹²æ¬¡ï¼Œæœ€åŽå°†æ®µè½éšæœºæ‰“乱。该任务è¦æ±‚模型判æ–给定的若干的段è½ä¸ä¸é‡å¤çš„段è½ä¸€å…±æœ‰å‡ 个。 +- PassageRetrieval-zh任务基于[C4](https://arxiv.org/abs/1910.10683)è¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘们éšæœºé‡‡æ ·è‹¥å¹²æ®µæ¥è‡ªäºŽC4çš„ä¸æ–‡æ®µè½ï¼Œå¹¶é€‰å–å…¶ä¸ä¸€æ®µè¿›è¡Œæ‘˜è¦ï¼ˆä½¿ç”¨GPT-3.5-Turbo)。该任务è¦æ±‚模型给出摘è¦åº”该对应哪个的原始段è½ã€‚ +- [LCC](https://arxiv.org/abs/2306.14893)任务我们基于原始的代ç 补全数æ®é›†é‡‡æ ·æž„建。[RepoBench-P](https://arxiv.org/abs/2306.03091)任务ä¸æˆ‘们选å–了原数æ®é›†æœ€å…·æŒ‘战性的XF-F(Cross-File-First)设定,并且å‚考原文ä¸çš„Oracle-Filled场景,对于æ¯ä¸€æ¡åŽŸå§‹æ•°æ®æˆ‘们éšæœºæŠ½å–包括有效跨文件代ç 片段(gold snippet)在内的多个跨文件代ç 片段,将其拼接åŽä½œä¸ºè¾“入,è¦æ±‚模型从其ä¸åˆ©ç”¨æœ‰æ•ˆçš„跨文件代ç 以补全当å‰æ–‡ä»¶ä¸çš„代ç 。 \ No newline at end of file -- GitLab