From 62fa455dd58c3084c346324bc35f70a057fdacfc Mon Sep 17 00:00:00 2001
From: Lv Xin <lvxin14@163.com>
Date: Sun, 30 Jul 2023 11:18:27 +0800
Subject: [PATCH] Change the default language of README to English

---
 README.md    | 292 +++++++++++++++++++++++++++------------------------
 README_EN.md | 187 ---------------------------------
 README_ZH.md | 175 ++++++++++++++++++++++++++++++
 task.md      | 119 ++++++++++-----------
 task_en.md   |  67 ------------
 task_zh.md   |  66 ++++++++++++
 6 files changed, 453 insertions(+), 453 deletions(-)
 delete mode 100644 README_EN.md
 create mode 100644 README_ZH.md
 delete mode 100644 task_en.md
 create mode 100644 task_zh.md

diff --git a/README.md b/README.md
index d97741a..95519ba 100644
--- a/README.md
+++ b/README.md
@@ -3,62 +3,65 @@
     🤗 <a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank">HF Repo</a> • 📃 Paper coming soon!
 </p>
 
-Read this in [English](README_EN.md).
-
-# LongBench: 多语言、多任务长文本理解评测基准
-
-**LongBench**是第一个多语言、多任务、全面评估大语言模型**长文本理解**能力的综合数据集。在目前大模型多语言能力引起广泛关注的背景下,LongBench涵盖了不同的语言(中文和英文),以此来对大模型在长文本下的多语言能力进行更全面的评估。同时,LongBench由二十个不同的任务组成,覆盖了单文档QA、多文档QA、摘要、Few-shot学习、代码补全和合成任务等关键的长文本应用场景。
-
-我们深知模型评测过程中可能产生的高昂成本,尤其是长文本场景下(如人工标注成本或API调用成本)。因此,我们采用了一种全自动的评测方式,旨在以最低的成本,最有效地衡量和评估模型的长文本理解能力。
-
-LongBench包含13个英文任务、5个中文任务和2个代码任务,多数任务的平均长度在5k-15k之间,共包含约4500条测试数据。从主要任务分类上,LongBench包含六大类任务,即单文档QA、多文档QA、摘要、Few-shot学习、代码补全和合成任务。关于LongBench数据集的具体统计及任务构造方式请参考[这里](task.md)。
-
-|   任务类型   | 英文任务数 | 中文任务数 | 代码任务数 |
-| :----------: | :--------: | :--------: | :--------: |
-|   单文档QA   |     3      |     1      |     -      |
-|   多文档QA   |     3      |     1      |     -      |
-|     摘要     |     2      |     1      |     -      |
-| Few-shot学习 |     3      |     1      |     -      |
-|   合成任务   |     2      |     1      |     -      |
-|   代码补全   |     -      |     -      |     2      |
-
-## 排行榜
-我们在这里展示了所有模型在Zero-shot场景下,在中文和英文各大类任务上得分的平均值(%),各任务评估所用指标请参考[这里](task.md)。
-#### 英文榜单
-|     | Avg | 单文档QA | 多文档QA | 摘要 | Few-shot学习 | 代码补全 | 合成任务 |
-| --- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
-| GPT-3.5-Turbo-16k | 45.5 | 39.8 | 38.7 | 26.5 | 76.0 | 54.5 | 37.8 |
-| Llama2-7B-chat-4k | 29.0 | 24.8 | 21.4 | 23.9 | 50.5 | 47.3 | 5.9 |
-| LongChat-7B-16k | 33.7 | 29.3 | 16.1 | 25.8 | 59.9 | 57.0 | 14.2 |
-| XGen-7B-8k | 28.7 | 24.5 | 20.4 | 24.8 | 58.7 | 38.0 | 5.6 |
-| InternLM-7B-8k | 24.7 | 17.1 | 20.8 | 13.3 | 52.7 | 39.7 | 4.7 |
-| ChatGLM2-6B | 26.0 | 23.1 | 15.0 | 22.9 | 46.1 | 46.1 | 2.7 |
-| ChatGLM2-6B-32k | 42.7 | 32.8 | 34.0 | 28.6 | 68.1 | 52.7 | 39.8 |
-
-#### 中文榜单
-|       | Avg | 单文档QA | 多文档QA | 摘要 | Few-shot学习 | 代码补全 | 合成任务 |
-|-------|:---:|:-------------:|:------------:|:-------------:|:-----------------:|:---------------:|:----------------:|
-| GPT-3.5-Turbo-16k | 44.5 | 61.2 | 28.7 | 16.0 | 29.2 | 54.5 | 77.5 |
-| Llama2-7B-chat-4k | 13.5 | 11.6 | 1.9 | 0.2 | 19.8 | 47.3 | 0.5 |
-| LongChat-7B-16k | 23.7 | 26.6 | 19.1 | 14.0 | 20.8 | 57.0 | 4.8 |
-| XGen-7B-8k | 14.5 | 14.2 | 9.1 | 1.5 | 20.0 | 38.0 | 4.2 |
-| InternLM-7B-8k | 18.6 | 33.3 | 8.9 | 13.0 | 15.5 | 39.7 | 0.9 |
-| ChatGLM2-6B | 22.5 | 33.0 | 15.2 | 14.6 | 20.5 | 46.1 | 5.5 |
-| ChatGLM2-6B-32k | 41.3 | 52.0 | 34.3 | 16.3 | 29.9 | 52.7 | 62.5 |
-
-#### 长文本任务能力雷达图
+阅读 [中文版本](README_ZH.md)
+
+# LongBench: A Multilingual, Multitask Benchmark Tailored for Long Context Understanding
+
+**LongBench** is the first comprehensive dataset for multi-language, multi-task, and comprehensive assessment of **long text understanding** capabilities of large language models. In the context of the widespread attention to the multi-language capabilities of large models, LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models' multi-language capabilities in long texts. In addition, LongBench consists of twenty different tasks, covering key long-text application scenarios such as single-document QA, multi-document QA, summaries, few-shot learning, code completion, and synthesis tasks.
+
+We are fully aware of the potentially high costs involved in the model evaluation process, especially in the context of long-text scenarios (such as manual annotation costs or API call costs). Therefore, we have adopted a fully automated evaluation method, aimed at measuring and evaluating the model's ability to understand long texts at the lowest cost and most effectively.
+
+LongBench includes 13 English tasks, 5 Chinese tasks, and 2 code tasks, with the average length of most tasks ranging from 5k to 15k. From the main task categories, LongBench includes six types of tasks, namely multi-document QA, single-document QA, summaries, Few-shot learning, synthetic tasks, and code completion. For detailed statistics and construction methods of LongBench tasks, please refer [here](task.md).
+
+| Task Type | \#English Task | \#Chinese Task | \#Code Task |
+| :-------: | :--------------------: | :--------------------: | :------------------: |
+| Multi-document QA | 3 | 1 | - |
+| Single-document QA | 3 | 1 | - |
+| Summarization | 2 | 1 | - |
+| Few-shot learning | 3 | 1 | - |
+| Synthetic Tasks | 2 | 1 | - |
+| Code Completion | - | - | 2 |
+
+## Leaderboard
+Here is the average score (%) of all models on various major tasks in both Chinese and English languages under the Zero-shot scenario. Please refer to this [link](task.md) for the evaluation metrics used for each task.
+
+#### English
+|                   | Avg  | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks |
+| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: |
+| GPT-3.5-Turbo-16k | 45.5 |     39.8      |     38.7     |     26.5      |       76.0        |      54.5       |      37.8       |
+| Llama2-7B-chat-4k | 29.0 |     24.8      |     21.4     |     23.9      |       50.5        |      47.3       |       5.9       |
+| LongChat-7B-16k   | 33.7 |     29.3      |     16.1     |     25.8      |       59.9        |      57.0       |      14.2       |
+| XGen-7B-8k        | 28.7 |     24.5      |     20.4     |     24.8      |       58.7        |      38.0       |       5.6       |
+| InternLM-7B-8k    | 24.7 |     17.1      |     20.8     |     13.3      |       52.7        |      39.7       |       4.7       |
+| ChatGLM2-6B       | 26.0 |     23.1      |     15.0     |     22.9      |       46.1        |      46.1       |       2.7       |
+| ChatGLM2-6B-32k   | 42.7 |     32.8      |     34.0     |     28.6      |       68.1        |      52.7       |      39.8       |
+
+#### Chinese
+
+|                   | Avg  | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks |
+| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: |
+| GPT-3.5-Turbo-16k | 44.5 |     61.2      |     28.7     |     16.0      |       29.2        |      54.5       |      77.5       |
+| Llama2-7B-chat-4k | 13.5 |     11.6      |     1.9      |      0.2      |       19.8        |      47.3       |       0.5       |
+| LongChat-7B-16k   | 23.7 |     26.6      |     19.1     |     14.0      |       20.8        |      57.0       |       4.8       |
+| XGen-7B-8k        | 14.5 |     14.2      |     9.1      |      1.5      |       20.0        |      38.0       |       4.2       |
+| InternLM-7B-8k    | 18.6 |     33.3      |     8.9      |     13.0      |       15.5        |      39.7       |       0.9       |
+| ChatGLM2-6B       | 22.5 |     33.0      |     15.2     |     14.6      |       20.5        |      46.1       |       5.5       |
+| ChatGLM2-6B-32k   | 41.3 |     52.0      |     34.3     |     16.3      |       29.9        |      52.7       |      62.5       |
+
+#### Radar Chart of Long Text Task Capability 
+
 ![](misc/radar.png)
 
-#### 不同长度文本下的能力变化
-为了更有针对性地分析模型在不同文本长度下的相对表现,下图展示了模型在不同文本长度区间上,所有任务上的平均相对分数。
+#### Variation of Abilities under Different Text Lengths
+To more specifically analyze the model's relative performance under different text lengths, the following chart shows the average relative scores on all tasks over different text length intervals.
 ![](misc/curve.png)
 
-> 注:假设模型在某个任务的特定长度范围内数据上得分为x,在该任务所有数据上得分为y,则模型在该长度范围的**相对分数**为(x/y-1)。为了更好比较不同模型的变化趋势,我们在0-4k将所有折线平移至0。
+> Note: Assume that the model scores x on the data within a specific length range of a task, and y on all data of that task, then the model's **relative score** for that length range is (x/y-1). To better compare the trends of different models, we shift all the lines to 0 from 0-4k.
 
-## 如何在LongBench上评测模型
+## How to evaluate models on LongBench
 
-#### 载入数据
-你可以通过Hugging Face datasets来下载并载入**LongBench**的数据([🤗 HF Repo](https://huggingface.co/datasets/THUDM/LongBench)):
+#### Loading Data
+You can download and load the **LongBench** data through the Hugging Face datasets ([🤗 HF Repo](https://huggingface.co/datasets/THUDM/LongBench)):
 ```python
 from datasets import load_dataset
 
@@ -69,107 +72,116 @@ datasets = ["hotpotqa", "2wikimqa", "musique", "dureader", "narrativeqa", "qaspe
 for dataset in datasets:
     data = load_dataset('THUDM/LongBench', dataset, split='test')
 ```
-同样地,你也可以直接用这个[链接](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip)下载所有的评测数据。
+Alternatively, you can download the folder from [this link](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip) to load the data.
+
+#### Data Format
+
+All data in **LongBench** are standardized to the following format:
 
-#### 数据格式
-**LongBench**中所有数据都统一为以下格式:
 ```json
 {
-    "input": "任务的输入/指令,通常较短,比如QA中的问题、Few-shot任务中的提问等",
-    "context": "任务所需的长语境文本,比如文档、跨文件代码、Few-shot任务中的few-shot样本",
-    "answers": "由所有标准答案组成的列表",
-    "length": "前三项文本的总长度(中、英文分别用字、词数统计)",
-    "dataset": "本条数据所属数据集名称",
-    "language": "本条数据的语言",
-    "all_classes": "分类任务中的所有类别,非分类任务则为null",
-    "_id": "每条数据的随机id"
+    "input": "The input/command for the task, usually short, such as questions in QA, queries in Few-shot tasks, etc.",
+    "context": "The long context text required for the task, such as documents, cross-file code, few-shot samples in Few-shot tasks",
+    "answers": "List composed of all standard answers",
+    "length": "Total length of the first three items of text (counted in characters for Chinese and words for English)",
+    "dataset": "The name of the dataset to which this piece of data belongs",
+    "language": "The language of this piece of data",
+    "all_classes": "All categories in classification tasks, null for non-classification tasks",
+    "_id": "Random id for each piece of data"
 }
 ```
 
-#### 评测
-我们以ChatGLM2-6B为例提供了一份评测代码。首先,运行仓库下的[pred.py](pred.py)
+#### Evaluation
+We provide an evaluation code using ChatGLM2-6B as an example. Firstly, run the [pred.py](pred.py) under the repository:
 ```bash
 CUDA_VISIBLE_DEVICES=0 python pred.py
 ```
-可以在`pred/`文件夹下得到模型在所有数据集下的输出,此后运行[eval.py](eval.py)的评测代码:
+You can get the model outputs on all datasets in the `pred/` folder. After that, run the evaluation code of [eval.py](eval.py):
 ```bash
 python eval.py
 ```
-可以在`result.json`中得到在各数据集上的评测结果。请注意,我们在`config/`下提供了我们总结出来的在各数据集上适合的输入格式和最大输出长度限制,在评测的时候可以进行修改以更好地适用你要评测的模型,修改后在[pred.py](pred.py)评测时会自动按照新的格式去整理数据并得到对应的模型输出。
+You can get the evaluation results on various datasets in `result.json`. Please note that we provide the input format suitable for each dataset and the maximum output length limit we summarized in `config/`. You can modify them during the evaluation to better suit the model you want to evaluate. After modification, when evaluating with [pred.py](pred.py), the data will be automatically organized according to the new format to get the corresponding model output.
+
+## Evaluation Result on Each Dataset
+
+The following tables display the Zero-shot evaluation results (%) of the model on all subtask datasets, where Chinese datasets are denoted by "zh" (please refer to this [link](task.md) for the evaluation metrics used for each task).
 
-## 详细评测结果
-下面的几张表格展示了模型在所有子任务数据集上的Zero-shot评测结果(%),其中的中文数据集以“zh”标示(各任务评估所用指标请参考[这里](task.md))。
-> 注:对于超出模型处理长度能力的文本,参考[Lost in the Middle](https://arxiv.org/abs/2307.03172)的观察,我们从文本中间进行截断,保持前后部分的信息。实验表明,这种截断方式对模型性能影响最小。
-#### 单文档QA
+#### Single-Document QA
 |                   | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh |
-|-------------------|:-----------:|:------:|:---------------:|:---------------:|
-| GPT-3.5-Turbo-16k | 23.6 | 43.3 | 52.3 | 61.2 |
-| Llama2-7B-chat-4k | 19.1 | 19.6 | 35.8 | 11.6 |
-| LongChat-7B-16k | 21.6 | 21.6 | 44.6 | 26.6 |
-| XGen-7B-8k | 17.9 | 18.3 | 37.2 | 14.2 |
-| InternLM-7B-8k | 12.4 | 16.8 | 22.3 | 33.3 |
-| ChatGLM2-6B | 11.2 | 23.7 | 34.2 | 33.0 |
-| ChatGLM2-6B-32k | 20.4 | 32.2 | 45.7 | 52.0 |
-
-#### 多文档QA
-|                      | HotpotQA | 2WikiMQA | Musique | DuReader (zh) |
-|----------------------|:--------:|:--------:|:-------:|:--------:|
-| GPT-3.5-Turbo-16k | 51.6 | 37.7 | 26.9 | 28.7 |
-| Llama2-7B-chat-4k | 24.3 | 31.4 | 8.6 | 1.9 |
-| LongChat-7B-16k | 22.4 | 16.8 | 9.1 | 19.1 |
-| XGen-7B-8k | 28.3 | 21.5 | 11.5 | 9.1 |
-| InternLM-7B-8k | 27.9 | 24.0 | 10.3 | 8.9 |
-| ChatGLM2-6B | 20.2 | 19.6 | 5.3 | 15.2 |
-| ChatGLM2-6B-32k | 44.9 | 34.9 | 22.2 | 34.3 |
-
-#### 摘要
-|            | GovReport | QMSum | VCSUM (zh) |
-|:-----------|:---------:|:-----:|:-----:|
-| GPT-3.5-Turbo-16k | 29.5 | 23.4 | 16.0 |
-| Llama2-7B-chat-4k | 27.3 | 20.6 | 0.2 |
-| LongChat-7B-16k | 28.4 | 23.2 | 14.0 |
-| XGen-7B-8k | 27.8 | 21.7 | 1.5 |
-| InternLM-7B-8k | 9.8 | 16.8 | 13.0 |
-| ChatGLM2-6B | 23.7 | 22.2 | 14.6 |
-| ChatGLM2-6B-32k | 33.3 | 23.9 | 16.3 |
-
-#### Few-shot学习
-|     | TREC | NQ | TriviaQA | LSHT (zh) |
-| --- | :-: | :-: | :-: | :-: |
-| GPT-3.5-Turbo-16k | 68.0 | 73.0 | 87.1 | 29.2 |
-| Llama2-7B-chat-4k | 60.5 | 31.4 | 59.7 | 19.8 |
-| LongChat-7B-16k | 61.5 | 44.8 | 73.5 | 20.8 |
-| XGen-7B-8k | 66.0 | 43.2 | 67.0 | 20.0 |
-| InternLM-7B-8k | 49.0 | 47.6 | 61.6 | 15.5 |
-| ChatGLM2-6B | 44.0 | 34.5 | 59.8 | 20.5 |
-| ChatGLM2-6B-32k | 62.0 | 64.9 | 77.6 | 29.9 |
-
-#### 代码补全
-|     | LCC | RepoBench-P |
-| --- | :-: | :-: |
-| GPT-3.5-Turbo-16k | 54.7 | 54.3 |
-| Llama2-7B-chat-4k | 52.3 | 42.4 |
-| LongChat-7B-16k | 59.2 | 54.7 |
-| XGen-7B-8k | 38.8 | 37.3 |
-| InternLM-7B-8k | 45.5 | 34.0 |
-| ChatGLM2-6B | 48.4 | 43.7 |
-| ChatGLM2-6B-32k | 55.4 | 50.0 |
-
-#### 合成任务
-|     | PassageRetrieval-en | Passage Count | PassageRetrieval-zh |
-| --- | :-: | :-: | :-: |
-| GPT-3.5-Turbo-16k | 71.0 | 4.5 | 77.5 |
-| Llama2-7B-chat-4k | 9.2 | 2.5 | 0.5 |
-| LongChat-7B-16k | 24.0 | 4.5 | 4.8 |
-| XGen-7B-8k | 9.0 | 2.2 | 4.2 |
-| InternLM-7B-8k | 6.5 | 2.9 | 0.9 |
-| ChatGLM2-6B | 3.2 | 2.1 | 5.5 |
-| ChatGLM2-6B-32k | 77.5 | 2.0 | 62.5 |
-
-## 致谢
-- **LongBench**的部分任务基于之前的研究者提出的数据集构建,包括[HotpotQA](https://hotpotqa.github.io/),[2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/),[Musique](https://arxiv.org/abs/2108.00573),[DuReader](https://github.com/baidu/DuReader),[NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf),[Qasper](https://arxiv.org/pdf/2105.03011.pdf),[GovReport](https://arxiv.org/pdf/2104.02112.pdf),[QMSum](https://arxiv.org/pdf/2104.05938.pdf),[VCSUM](https://arxiv.org/abs/2305.05280),[TriviaQA](https://nlp.cs.washington.edu/triviaqa/),[NQ](https://ai.google.com/research/NaturalQuestions/),[TREC](https://aclanthology.org/C02-1150.pdf),[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf),[LCC](https://arxiv.org/abs/2306.14893)和[RepoBench-P](https://arxiv.org/abs/2306.03091)。
-
-## 引用
-本工作由**THUKEG**和**Zhipu AI**共同完成,相关论文正在撰写中,届时将更新引用信息,敬请关注~
-
-如果您使用该Benchmark,可以一并引用LongBench所基于的数据集对应的论文,相关引用信息列举在[这里](refs/ref.bib)。
+| ----------------- | :---------: | :----: | :-------------: | :-------------: |
+| GPT-3.5-Turbo-16k |    23.6     |  43.3  |      52.3       |      61.2       |
+| Llama2-7B-chat-4k |    19.1     |  19.6  |      35.8       |      11.6       |
+| LongChat-7B-16k   |    21.6     |  21.6  |      44.6       |      26.6       |
+| XGen-7B-8k        |    17.9     |  18.3  |      37.2       |      14.2       |
+| InternLM-7B-8k    |    12.4     |  16.8  |      22.3       |      33.3       |
+| ChatGLM2-6B       |    11.2     |  23.7  |      34.2       |      33.0       |
+| ChatGLM2-6B-32k   |    20.4     |  32.2  |      45.7       |      52.0       |
+
+#### Multi-Document QA
+
+|                   | HotpotQA | 2WikiMQA | Musique | DuReader (zh) |
+| ----------------- | :------: | :------: | :-----: | :-----------: |
+| GPT-3.5-Turbo-16k |   51.6   |   37.7   |  26.9   |     28.7      |
+| Llama2-7B-chat-4k |   24.3   |   31.4   |   8.6   |      1.9      |
+| LongChat-7B-16k   |   22.4   |   16.8   |   9.1   |     19.1      |
+| XGen-7B-8k        |   28.3   |   21.5   |  11.5   |      9.1      |
+| InternLM-7B-8k    |   27.9   |   24.0   |  10.3   |      8.9      |
+| ChatGLM2-6B       |   20.2   |   19.6   |   5.3   |     15.2      |
+| ChatGLM2-6B-32k   |   44.9   |   34.9   |  22.2   |     34.3      |
+
+#### Summarization
+
+|                   | GovReport | QMSum | VCSUM (zh) |
+| :---------------- | :-------: | :---: | :--------: |
+| GPT-3.5-Turbo-16k |   29.5    | 23.4  |    16.0    |
+| Llama2-7B-chat-4k |   27.3    | 20.6  |    0.2     |
+| LongChat-7B-16k   |   28.4    | 23.2  |    14.0    |
+| XGen-7B-8k        |   27.8    | 21.7  |    1.5     |
+| InternLM-7B-8k    |    9.8    | 16.8  |    13.0    |
+| ChatGLM2-6B       |   23.7    | 22.2  |    14.6    |
+| ChatGLM2-6B-32k   |   33.3    | 23.9  |    16.3    |
+
+#### Few-shot Learning
+
+|                   | TREC |  NQ  | TriviaQA | LSHT (zh) |
+| ----------------- | :--: | :--: | :------: | :-------: |
+| GPT-3.5-Turbo-16k | 68.0 | 73.0 |   87.1   |   29.2    |
+| Llama2-7B-chat-4k | 60.5 | 31.4 |   59.7   |   19.8    |
+| LongChat-7B-16k   | 61.5 | 44.8 |   73.5   |   20.8    |
+| XGen-7B-8k        | 66.0 | 43.2 |   67.0   |   20.0    |
+| InternLM-7B-8k    | 49.0 | 47.6 |   61.6   |   15.5    |
+| ChatGLM2-6B       | 44.0 | 34.5 |   59.8   |   20.5    |
+| ChatGLM2-6B-32k   | 62.0 | 64.9 |   77.6   |   29.9    |
+
+#### Code Completion
+
+|                   | LCC  | RepoBench-P |
+| ----------------- | :--: | :---------: |
+| GPT-3.5-Turbo-16k | 54.7 |    54.3     |
+| Llama2-7B-chat-4k | 52.3 |    42.4     |
+| LongChat-7B-16k   | 59.2 |    54.7     |
+| XGen-7B-8k        | 38.8 |    37.3     |
+| InternLM-7B-8k    | 45.5 |    34.0     |
+| ChatGLM2-6B       | 48.4 |    43.7     |
+| ChatGLM2-6B-32k   | 55.4 |    50.0     |
+
+#### Synthetic Tasks
+
+|                   | PassageRetrieval-en | Passage Count | PassageRetrieval-zh |
+| ----------------- | :-----------------: | :-----------: | :-----------------: |
+| GPT-3.5-Turbo-16k |        71.0         |      4.5      |        77.5         |
+| Llama2-7B-chat-4k |         9.2         |      2.5      |         0.5         |
+| LongChat-7B-16k   |        24.0         |      4.5      |         4.8         |
+| XGen-7B-8k        |         9.0         |      2.2      |         4.2         |
+| InternLM-7B-8k    |         6.5         |      2.9      |         0.9         |
+| ChatGLM2-6B       |         3.2         |      2.1      |         5.5         |
+| ChatGLM2-6B-32k   |        77.5         |      2.0      |        62.5         |
+
+## Acknowledgements
+
+- Some of the tasks of **LongBench** are based on the datasets proposed by previous researchers, including [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), [DuReader](https://github.com/baidu/DuReader), [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), [QMSum](https://arxiv.org/pdf/2104.05938.pdf), [VCSUM](https://arxiv.org/abs/2305.05280), [TriviaQA](https://nlp.cs.washington.edu/triviaqa/), [NQ](https://ai.google.com/research/NaturalQuestions/), [TREC](https://aclanthology.org/C02-1150.pdf), [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf), [LCC](https://arxiv.org/abs/2306.14893) and [RepoBench-P](https://arxiv.org/abs/2306.03091).
+
+## Citation
+This work is jointly completed by **THUKEG** and **Zhipu AI**. The related paper is currently being written, and the citation information will be updated when it's ready. Please stay tuned~
+
+If you use this benchmark, you can also cite the papers corresponding to the datasets that LongBench is based on. The relevant citation information is listed [here](refs/ref.bib).
\ No newline at end of file
diff --git a/README_EN.md b/README_EN.md
deleted file mode 100644
index 629c756..0000000
--- a/README_EN.md
+++ /dev/null
@@ -1,187 +0,0 @@
-![](misc/logo.gif)
-<p align="center">
-    🤗 <a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank">HF Repo</a> • 📃 Paper coming soon!
-</p>
-
-阅读 [中文版本](README.md)
-
-# LongBench: A Multilingual, Multitask Benchmark Tailored for Long Context Understanding
-
-**LongBench** is the first comprehensive dataset for multi-language, multi-task, and comprehensive assessment of **long text understanding** capabilities of large language models. In the context of the widespread attention to the multi-language capabilities of large models, LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models' multi-language capabilities in long texts. In addition, LongBench consists of twenty different tasks, covering key long-text application scenarios such as single-document QA, multi-document QA, summaries, few-shot learning, code completion, and synthesis tasks.
-
-We are fully aware of the potentially high costs involved in the model evaluation process, especially in the context of long-text scenarios (such as manual annotation costs or API call costs). Therefore, we have adopted a fully automated evaluation method, aimed at measuring and evaluating the model's ability to understand long texts at the lowest cost and most effectively.
-
-LongBench includes 13 English tasks, 5 Chinese tasks, and 2 code tasks, with the average length of most tasks ranging from 5k to 15k. From the main task categories, LongBench includes six types of tasks, namely multi-document QA, single-document QA, summaries, Few-shot learning, synthetic tasks, and code completion. For detailed statistics and construction methods of LongBench tasks, please refer [here](task_en.md).
-
-| Task Type | \#English Task | \#Chinese Task | \#Code Task |
-| :-------: | :--------------------: | :--------------------: | :------------------: |
-| Multi-document QA | 3 | 1 | - |
-| Single-document QA | 3 | 1 | - |
-| Summarization | 2 | 1 | - |
-| Few-shot learning | 3 | 1 | - |
-| Synthetic Tasks | 2 | 1 | - |
-| Code Completion | - | - | 2 |
-
-## Leaderboard
-Here is the average score (%) of all models on various major tasks in both Chinese and English languages under the Zero-shot scenario. Please refer to this [link](task_en.md) for the evaluation metrics used for each task.
-
-#### English
-|                   | Avg  | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks |
-| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: |
-| GPT-3.5-Turbo-16k | 45.5 |     39.8      |     38.7     |     26.5      |       76.0        |      54.5       |      37.8       |
-| Llama2-7B-chat-4k | 29.0 |     24.8      |     21.4     |     23.9      |       50.5        |      47.3       |       5.9       |
-| LongChat-7B-16k   | 33.7 |     29.3      |     16.1     |     25.8      |       59.9        |      57.0       |      14.2       |
-| XGen-7B-8k        | 28.7 |     24.5      |     20.4     |     24.8      |       58.7        |      38.0       |       5.6       |
-| InternLM-7B-8k    | 24.7 |     17.1      |     20.8     |     13.3      |       52.7        |      39.7       |       4.7       |
-| ChatGLM2-6B       | 26.0 |     23.1      |     15.0     |     22.9      |       46.1        |      46.1       |       2.7       |
-| ChatGLM2-6B-32k   | 42.7 |     32.8      |     34.0     |     28.6      |       68.1        |      52.7       |      39.8       |
-
-#### Chinese
-
-|                   | Avg  | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks |
-| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: |
-| GPT-3.5-Turbo-16k | 44.5 |     61.2      |     28.7     |     16.0      |       29.2        |      54.5       |      77.5       |
-| Llama2-7B-chat-4k | 13.5 |     11.6      |     1.9      |      0.2      |       19.8        |      47.3       |       0.5       |
-| LongChat-7B-16k   | 23.7 |     26.6      |     19.1     |     14.0      |       20.8        |      57.0       |       4.8       |
-| XGen-7B-8k        | 14.5 |     14.2      |     9.1      |      1.5      |       20.0        |      38.0       |       4.2       |
-| InternLM-7B-8k    | 18.6 |     33.3      |     8.9      |     13.0      |       15.5        |      39.7       |       0.9       |
-| ChatGLM2-6B       | 22.5 |     33.0      |     15.2     |     14.6      |       20.5        |      46.1       |       5.5       |
-| ChatGLM2-6B-32k   | 41.3 |     52.0      |     34.3     |     16.3      |       29.9        |      52.7       |      62.5       |
-
-#### Radar Chart of Long Text Task Capability 
-
-![](misc/radar.png)
-
-#### Variation of Abilities under Different Text Lengths
-To more specifically analyze the model's relative performance under different text lengths, the following chart shows the average relative scores on all tasks over different text length intervals.
-![](misc/curve.png)
-
-> Note: Assume that the model scores x on the data within a specific length range of a task, and y on all data of that task, then the model's **relative score** for that length range is (x/y-1). To better compare the trends of different models, we shift all the lines to 0 from 0-4k.
-
-## How to evaluate models on LongBench
-
-#### Loading Data
-You can download and load the **LongBench** data through the Hugging Face datasets ([🤗 HF Repo](https://huggingface.co/datasets/THUDM/LongBench)):
-```python
-from datasets import load_dataset
-
-datasets = ["hotpotqa", "2wikimqa", "musique", "dureader", "narrativeqa", "qasper", "multifieldqa_en", \
-    "multifieldqa_zh", "gov_report", "qmsum", "vcsum", "trec", "nq", "triviaqa", "lsht", "passage_count", \
-    "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"]
-
-for dataset in datasets:
-    data = load_dataset('THUDM/LongBench', dataset, split='test')
-```
-Alternatively, you can download the folder from [this link](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip) to load the data.
-
-#### Data Format
-
-All data in **LongBench** are standardized to the following format:
-
-```json
-{
-    "input": "The input/command for the task, usually short, such as questions in QA, queries in Few-shot tasks, etc.",
-    "context": "The long context text required for the task, such as documents, cross-file code, few-shot samples in Few-shot tasks",
-    "answers": "List composed of all standard answers",
-    "length": "Total length of the first three items of text (counted in characters for Chinese and words for English)",
-    "dataset": "The name of the dataset to which this piece of data belongs",
-    "language": "The language of this piece of data",
-    "all_classes": "All categories in classification tasks, null for non-classification tasks",
-    "_id": "Random id for each piece of data"
-}
-```
-
-#### Evaluation
-We provide an evaluation code using ChatGLM2-6B as an example. Firstly, run the [pred.py](pred.py) under the repository:
-```bash
-CUDA_VISIBLE_DEVICES=0 python pred.py
-```
-You can get the model outputs on all datasets in the `pred/` folder. After that, run the evaluation code of [eval.py](eval.py):
-```bash
-python eval.py
-```
-You can get the evaluation results on various datasets in `result.json`. Please note that we provide the input format suitable for each dataset and the maximum output length limit we summarized in `config/`. You can modify them during the evaluation to better suit the model you want to evaluate. After modification, when evaluating with [pred.py](pred.py), the data will be automatically organized according to the new format to get the corresponding model output.
-
-## Evaluation Result on Each Dataset
-
-The following tables display the Zero-shot evaluation results (%) of the model on all subtask datasets, where Chinese datasets are denoted by "zh" (please refer to this [link](task_en.md) for the evaluation metrics used for each task).
-
-#### Single-Document QA
-|                   | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh |
-| ----------------- | :---------: | :----: | :-------------: | :-------------: |
-| GPT-3.5-Turbo-16k |    23.6     |  43.3  |      52.3       |      61.2       |
-| Llama2-7B-chat-4k |    19.1     |  19.6  |      35.8       |      11.6       |
-| LongChat-7B-16k   |    21.6     |  21.6  |      44.6       |      26.6       |
-| XGen-7B-8k        |    17.9     |  18.3  |      37.2       |      14.2       |
-| InternLM-7B-8k    |    12.4     |  16.8  |      22.3       |      33.3       |
-| ChatGLM2-6B       |    11.2     |  23.7  |      34.2       |      33.0       |
-| ChatGLM2-6B-32k   |    20.4     |  32.2  |      45.7       |      52.0       |
-
-#### Multi-Document QA
-
-|                   | HotpotQA | 2WikiMQA | Musique | DuReader (zh) |
-| ----------------- | :------: | :------: | :-----: | :-----------: |
-| GPT-3.5-Turbo-16k |   51.6   |   37.7   |  26.9   |     28.7      |
-| Llama2-7B-chat-4k |   24.3   |   31.4   |   8.6   |      1.9      |
-| LongChat-7B-16k   |   22.4   |   16.8   |   9.1   |     19.1      |
-| XGen-7B-8k        |   28.3   |   21.5   |  11.5   |      9.1      |
-| InternLM-7B-8k    |   27.9   |   24.0   |  10.3   |      8.9      |
-| ChatGLM2-6B       |   20.2   |   19.6   |   5.3   |     15.2      |
-| ChatGLM2-6B-32k   |   44.9   |   34.9   |  22.2   |     34.3      |
-
-#### Summarization
-
-|                   | GovReport | QMSum | VCSUM (zh) |
-| :---------------- | :-------: | :---: | :--------: |
-| GPT-3.5-Turbo-16k |   29.5    | 23.4  |    16.0    |
-| Llama2-7B-chat-4k |   27.3    | 20.6  |    0.2     |
-| LongChat-7B-16k   |   28.4    | 23.2  |    14.0    |
-| XGen-7B-8k        |   27.8    | 21.7  |    1.5     |
-| InternLM-7B-8k    |    9.8    | 16.8  |    13.0    |
-| ChatGLM2-6B       |   23.7    | 22.2  |    14.6    |
-| ChatGLM2-6B-32k   |   33.3    | 23.9  |    16.3    |
-
-#### Few-shot Learning
-
-|                   | TREC |  NQ  | TriviaQA | LSHT (zh) |
-| ----------------- | :--: | :--: | :------: | :-------: |
-| GPT-3.5-Turbo-16k | 68.0 | 73.0 |   87.1   |   29.2    |
-| Llama2-7B-chat-4k | 60.5 | 31.4 |   59.7   |   19.8    |
-| LongChat-7B-16k   | 61.5 | 44.8 |   73.5   |   20.8    |
-| XGen-7B-8k        | 66.0 | 43.2 |   67.0   |   20.0    |
-| InternLM-7B-8k    | 49.0 | 47.6 |   61.6   |   15.5    |
-| ChatGLM2-6B       | 44.0 | 34.5 |   59.8   |   20.5    |
-| ChatGLM2-6B-32k   | 62.0 | 64.9 |   77.6   |   29.9    |
-
-#### Code Completion
-
-|                   | LCC  | RepoBench-P |
-| ----------------- | :--: | :---------: |
-| GPT-3.5-Turbo-16k | 54.7 |    54.3     |
-| Llama2-7B-chat-4k | 52.3 |    42.4     |
-| LongChat-7B-16k   | 59.2 |    54.7     |
-| XGen-7B-8k        | 38.8 |    37.3     |
-| InternLM-7B-8k    | 45.5 |    34.0     |
-| ChatGLM2-6B       | 48.4 |    43.7     |
-| ChatGLM2-6B-32k   | 55.4 |    50.0     |
-
-#### Synthetic Tasks
-
-|                   | PassageRetrieval-en | Passage Count | PassageRetrieval-zh |
-| ----------------- | :-----------------: | :-----------: | :-----------------: |
-| GPT-3.5-Turbo-16k |        71.0         |      4.5      |        77.5         |
-| Llama2-7B-chat-4k |         9.2         |      2.5      |         0.5         |
-| LongChat-7B-16k   |        24.0         |      4.5      |         4.8         |
-| XGen-7B-8k        |         9.0         |      2.2      |         4.2         |
-| InternLM-7B-8k    |         6.5         |      2.9      |         0.9         |
-| ChatGLM2-6B       |         3.2         |      2.1      |         5.5         |
-| ChatGLM2-6B-32k   |        77.5         |      2.0      |        62.5         |
-
-## Acknowledgements
-
-- Some of the tasks of **LongBench** are based on the datasets proposed by previous researchers, including [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), [DuReader](https://github.com/baidu/DuReader), [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), [QMSum](https://arxiv.org/pdf/2104.05938.pdf), [VCSUM](https://arxiv.org/abs/2305.05280), [TriviaQA](https://nlp.cs.washington.edu/triviaqa/), [NQ](https://ai.google.com/research/NaturalQuestions/), [TREC](https://aclanthology.org/C02-1150.pdf), [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf), [LCC](https://arxiv.org/abs/2306.14893) and [RepoBench-P](https://arxiv.org/abs/2306.03091).
-
-## Citation
-This work is jointly completed by **THUKEG** and **Zhipu AI**. The related paper is currently being written, and the citation information will be updated when it's ready. Please stay tuned~
-
-If you use this benchmark, you can also cite the papers corresponding to the datasets that LongBench is based on. The relevant citation information is listed [here](refs/ref.bib).
\ No newline at end of file
diff --git a/README_ZH.md b/README_ZH.md
new file mode 100644
index 0000000..3eff75c
--- /dev/null
+++ b/README_ZH.md
@@ -0,0 +1,175 @@
+![](misc/logo.gif)
+<p align="center">
+    🤗 <a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank">HF Repo</a> • 📃 Paper coming soon!
+</p>
+
+Read this in [English](README.md).
+
+# LongBench: 多语言、多任务长文本理解评测基准
+
+**LongBench**是第一个多语言、多任务、全面评估大语言模型**长文本理解**能力的综合数据集。在目前大模型多语言能力引起广泛关注的背景下,LongBench涵盖了不同的语言(中文和英文),以此来对大模型在长文本下的多语言能力进行更全面的评估。同时,LongBench由二十个不同的任务组成,覆盖了单文档QA、多文档QA、摘要、Few-shot学习、代码补全和合成任务等关键的长文本应用场景。
+
+我们深知模型评测过程中可能产生的高昂成本,尤其是长文本场景下(如人工标注成本或API调用成本)。因此,我们采用了一种全自动的评测方式,旨在以最低的成本,最有效地衡量和评估模型的长文本理解能力。
+
+LongBench包含13个英文任务、5个中文任务和2个代码任务,多数任务的平均长度在5k-15k之间,共包含约4500条测试数据。从主要任务分类上,LongBench包含六大类任务,即单文档QA、多文档QA、摘要、Few-shot学习、代码补全和合成任务。关于LongBench数据集的具体统计及任务构造方式请参考[这里](task_zh.md)。
+
+|   任务类型   | 英文任务数 | 中文任务数 | 代码任务数 |
+| :----------: | :--------: | :--------: | :--------: |
+|   单文档QA   |     3      |     1      |     -      |
+|   多文档QA   |     3      |     1      |     -      |
+|     摘要     |     2      |     1      |     -      |
+| Few-shot学习 |     3      |     1      |     -      |
+|   合成任务   |     2      |     1      |     -      |
+|   代码补全   |     -      |     -      |     2      |
+
+## 排行榜
+我们在这里展示了所有模型在Zero-shot场景下,在中文和英文各大类任务上得分的平均值(%),各任务评估所用指标请参考[这里](task_zh.md)。
+#### 英文榜单
+|     | Avg | 单文档QA | 多文档QA | 摘要 | Few-shot学习 | 代码补全 | 合成任务 |
+| --- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
+| GPT-3.5-Turbo-16k | 45.5 | 39.8 | 38.7 | 26.5 | 76.0 | 54.5 | 37.8 |
+| Llama2-7B-chat-4k | 29.0 | 24.8 | 21.4 | 23.9 | 50.5 | 47.3 | 5.9 |
+| LongChat-7B-16k | 33.7 | 29.3 | 16.1 | 25.8 | 59.9 | 57.0 | 14.2 |
+| XGen-7B-8k | 28.7 | 24.5 | 20.4 | 24.8 | 58.7 | 38.0 | 5.6 |
+| InternLM-7B-8k | 24.7 | 17.1 | 20.8 | 13.3 | 52.7 | 39.7 | 4.7 |
+| ChatGLM2-6B | 26.0 | 23.1 | 15.0 | 22.9 | 46.1 | 46.1 | 2.7 |
+| ChatGLM2-6B-32k | 42.7 | 32.8 | 34.0 | 28.6 | 68.1 | 52.7 | 39.8 |
+
+#### 中文榜单
+|       | Avg | 单文档QA | 多文档QA | 摘要 | Few-shot学习 | 代码补全 | 合成任务 |
+|-------|:---:|:-------------:|:------------:|:-------------:|:-----------------:|:---------------:|:----------------:|
+| GPT-3.5-Turbo-16k | 44.5 | 61.2 | 28.7 | 16.0 | 29.2 | 54.5 | 77.5 |
+| Llama2-7B-chat-4k | 13.5 | 11.6 | 1.9 | 0.2 | 19.8 | 47.3 | 0.5 |
+| LongChat-7B-16k | 23.7 | 26.6 | 19.1 | 14.0 | 20.8 | 57.0 | 4.8 |
+| XGen-7B-8k | 14.5 | 14.2 | 9.1 | 1.5 | 20.0 | 38.0 | 4.2 |
+| InternLM-7B-8k | 18.6 | 33.3 | 8.9 | 13.0 | 15.5 | 39.7 | 0.9 |
+| ChatGLM2-6B | 22.5 | 33.0 | 15.2 | 14.6 | 20.5 | 46.1 | 5.5 |
+| ChatGLM2-6B-32k | 41.3 | 52.0 | 34.3 | 16.3 | 29.9 | 52.7 | 62.5 |
+
+#### 长文本任务能力雷达图
+![](misc/radar.png)
+
+#### 不同长度文本下的能力变化
+为了更有针对性地分析模型在不同文本长度下的相对表现,下图展示了模型在不同文本长度区间上,所有任务上的平均相对分数。
+![](misc/curve.png)
+
+> 注:假设模型在某个任务的特定长度范围内数据上得分为x,在该任务所有数据上得分为y,则模型在该长度范围的**相对分数**为(x/y-1)。为了更好比较不同模型的变化趋势,我们在0-4k将所有折线平移至0。
+
+## 如何在LongBench上评测模型
+
+#### 载入数据
+你可以通过Hugging Face datasets来下载并载入**LongBench**的数据([🤗 HF Repo](https://huggingface.co/datasets/THUDM/LongBench)):
+```python
+from datasets import load_dataset
+
+datasets = ["hotpotqa", "2wikimqa", "musique", "dureader", "narrativeqa", "qasper", "multifieldqa_en", \
+    "multifieldqa_zh", "gov_report", "qmsum", "vcsum", "trec", "nq", "triviaqa", "lsht", "passage_count", \
+    "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"]
+
+for dataset in datasets:
+    data = load_dataset('THUDM/LongBench', dataset, split='test')
+```
+同样地,你也可以直接用这个[链接](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip)下载所有的评测数据。
+
+#### 数据格式
+**LongBench**中所有数据都统一为以下格式:
+```json
+{
+    "input": "任务的输入/指令,通常较短,比如QA中的问题、Few-shot任务中的提问等",
+    "context": "任务所需的长语境文本,比如文档、跨文件代码、Few-shot任务中的few-shot样本",
+    "answers": "由所有标准答案组成的列表",
+    "length": "前三项文本的总长度(中、英文分别用字、词数统计)",
+    "dataset": "本条数据所属数据集名称",
+    "language": "本条数据的语言",
+    "all_classes": "分类任务中的所有类别,非分类任务则为null",
+    "_id": "每条数据的随机id"
+}
+```
+
+#### 评测
+我们以ChatGLM2-6B为例提供了一份评测代码。首先,运行仓库下的[pred.py](pred.py)
+```bash
+CUDA_VISIBLE_DEVICES=0 python pred.py
+```
+可以在`pred/`文件夹下得到模型在所有数据集下的输出,此后运行[eval.py](eval.py)的评测代码:
+```bash
+python eval.py
+```
+可以在`result.json`中得到在各数据集上的评测结果。请注意,我们在`config/`下提供了我们总结出来的在各数据集上适合的输入格式和最大输出长度限制,在评测的时候可以进行修改以更好地适用你要评测的模型,修改后在[pred.py](pred.py)评测时会自动按照新的格式去整理数据并得到对应的模型输出。
+
+## 详细评测结果
+下面的几张表格展示了模型在所有子任务数据集上的Zero-shot评测结果(%),其中的中文数据集以“zh”标示(各任务评估所用指标请参考[这里](task_zh.md))。
+> 注:对于超出模型处理长度能力的文本,参考[Lost in the Middle](https://arxiv.org/abs/2307.03172)的观察,我们从文本中间进行截断,保持前后部分的信息。实验表明,这种截断方式对模型性能影响最小。
+#### 单文档QA
+|                   | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh |
+|-------------------|:-----------:|:------:|:---------------:|:---------------:|
+| GPT-3.5-Turbo-16k | 23.6 | 43.3 | 52.3 | 61.2 |
+| Llama2-7B-chat-4k | 19.1 | 19.6 | 35.8 | 11.6 |
+| LongChat-7B-16k | 21.6 | 21.6 | 44.6 | 26.6 |
+| XGen-7B-8k | 17.9 | 18.3 | 37.2 | 14.2 |
+| InternLM-7B-8k | 12.4 | 16.8 | 22.3 | 33.3 |
+| ChatGLM2-6B | 11.2 | 23.7 | 34.2 | 33.0 |
+| ChatGLM2-6B-32k | 20.4 | 32.2 | 45.7 | 52.0 |
+
+#### 多文档QA
+|                      | HotpotQA | 2WikiMQA | Musique | DuReader (zh) |
+|----------------------|:--------:|:--------:|:-------:|:--------:|
+| GPT-3.5-Turbo-16k | 51.6 | 37.7 | 26.9 | 28.7 |
+| Llama2-7B-chat-4k | 24.3 | 31.4 | 8.6 | 1.9 |
+| LongChat-7B-16k | 22.4 | 16.8 | 9.1 | 19.1 |
+| XGen-7B-8k | 28.3 | 21.5 | 11.5 | 9.1 |
+| InternLM-7B-8k | 27.9 | 24.0 | 10.3 | 8.9 |
+| ChatGLM2-6B | 20.2 | 19.6 | 5.3 | 15.2 |
+| ChatGLM2-6B-32k | 44.9 | 34.9 | 22.2 | 34.3 |
+
+#### 摘要
+|            | GovReport | QMSum | VCSUM (zh) |
+|:-----------|:---------:|:-----:|:-----:|
+| GPT-3.5-Turbo-16k | 29.5 | 23.4 | 16.0 |
+| Llama2-7B-chat-4k | 27.3 | 20.6 | 0.2 |
+| LongChat-7B-16k | 28.4 | 23.2 | 14.0 |
+| XGen-7B-8k | 27.8 | 21.7 | 1.5 |
+| InternLM-7B-8k | 9.8 | 16.8 | 13.0 |
+| ChatGLM2-6B | 23.7 | 22.2 | 14.6 |
+| ChatGLM2-6B-32k | 33.3 | 23.9 | 16.3 |
+
+#### Few-shot学习
+|     | TREC | NQ | TriviaQA | LSHT (zh) |
+| --- | :-: | :-: | :-: | :-: |
+| GPT-3.5-Turbo-16k | 68.0 | 73.0 | 87.1 | 29.2 |
+| Llama2-7B-chat-4k | 60.5 | 31.4 | 59.7 | 19.8 |
+| LongChat-7B-16k | 61.5 | 44.8 | 73.5 | 20.8 |
+| XGen-7B-8k | 66.0 | 43.2 | 67.0 | 20.0 |
+| InternLM-7B-8k | 49.0 | 47.6 | 61.6 | 15.5 |
+| ChatGLM2-6B | 44.0 | 34.5 | 59.8 | 20.5 |
+| ChatGLM2-6B-32k | 62.0 | 64.9 | 77.6 | 29.9 |
+
+#### 代码补全
+|     | LCC | RepoBench-P |
+| --- | :-: | :-: |
+| GPT-3.5-Turbo-16k | 54.7 | 54.3 |
+| Llama2-7B-chat-4k | 52.3 | 42.4 |
+| LongChat-7B-16k | 59.2 | 54.7 |
+| XGen-7B-8k | 38.8 | 37.3 |
+| InternLM-7B-8k | 45.5 | 34.0 |
+| ChatGLM2-6B | 48.4 | 43.7 |
+| ChatGLM2-6B-32k | 55.4 | 50.0 |
+
+#### 合成任务
+|     | PassageRetrieval-en | Passage Count | PassageRetrieval-zh |
+| --- | :-: | :-: | :-: |
+| GPT-3.5-Turbo-16k | 71.0 | 4.5 | 77.5 |
+| Llama2-7B-chat-4k | 9.2 | 2.5 | 0.5 |
+| LongChat-7B-16k | 24.0 | 4.5 | 4.8 |
+| XGen-7B-8k | 9.0 | 2.2 | 4.2 |
+| InternLM-7B-8k | 6.5 | 2.9 | 0.9 |
+| ChatGLM2-6B | 3.2 | 2.1 | 5.5 |
+| ChatGLM2-6B-32k | 77.5 | 2.0 | 62.5 |
+
+## 致谢
+- **LongBench**的部分任务基于之前的研究者提出的数据集构建,包括[HotpotQA](https://hotpotqa.github.io/),[2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/),[Musique](https://arxiv.org/abs/2108.00573),[DuReader](https://github.com/baidu/DuReader),[NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf),[Qasper](https://arxiv.org/pdf/2105.03011.pdf),[GovReport](https://arxiv.org/pdf/2104.02112.pdf),[QMSum](https://arxiv.org/pdf/2104.05938.pdf),[VCSUM](https://arxiv.org/abs/2305.05280),[TriviaQA](https://nlp.cs.washington.edu/triviaqa/),[NQ](https://ai.google.com/research/NaturalQuestions/),[TREC](https://aclanthology.org/C02-1150.pdf),[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf),[LCC](https://arxiv.org/abs/2306.14893)和[RepoBench-P](https://arxiv.org/abs/2306.03091)。
+
+## 引用
+本工作由**THUKEG**和**Zhipu AI**共同完成,相关论文正在撰写中,届时将更新引用信息,敬请关注~
+
+如果您使用该Benchmark,可以一并引用LongBench所基于的数据集对应的论文,相关引用信息列举在[这里](refs/ref.bib)。
diff --git a/task.md b/task.md
index 67a6a0d..186164f 100644
--- a/task.md
+++ b/task.md
@@ -1,66 +1,67 @@
-# 任务统计
+# Task statistics
 
-| 任务              |      任务类型  |      评价指标  |     平均长度                                       |语言 | Sample数量|
-| :--------- | :-----------:| :-----------: |:---------: | :-------------: |:---------: |
-| HotpotQA   | 多文档QA       | F1                        |9,149                           |英文                           |200                           |
-| 2WikiMultihopQA| 多文档QA | F1                        |4,885                           |英文                           |200                           |
-| Musique| 多文档QA   | F1                        |7,798                           |英文                           |200                           |
-| DuReader| 多文档QA  | Rouge-L                 |15,768                           |中文                           |200                           |
-| MultiFieldQA-en| 单文档QA | F1                        |4,559                           |英文                           |150                           |
-| MultiFieldQA-zh| 单文档QA | F1                        |6,771                           |中文                           |200                           |
-| NarrativeQA| 单文档QA | F1                        |18,405                           |英文                           |200                           |
-| Qasper| 单文档QA    | F1                        |3,619                           |英文                           |200                           |
-| GovReport| 摘要 | Rouge-L                 |8,169                           |英文                           |200                           |
-| QMSum| 摘要     | Rouge-L                 |10,546                           |英文                           |200                           |
-| VCSUM| 摘要     | Rouge-L                 |15,147                           |中文                           |200                           |
-| TriviaQA| Few shot  | F1                        |8,015                           |英文                           |200                           |
-| NQ| Few shot | F1                        |8,210                           |英文                           |200                           |
-| TREC| Few shot | Accuracy                |5,176                           |英文                           |200                           |
-| LSHT| Few shot | Accuracy                |22,333                           |中文                           |200                           |
-| PassageRetrieval-en| 合成任务 | Accuracy                |9,288                           |英文                           |200                           |
-| PassageCount| 合成任务 | Accuracy                |11,141                           |英文                           |200                           |
-| PassageRetrieval-zh | 合成任务 | Accuracy                |6,745                           |中文                           |200                           |
-| LCC| 代码 | Edit Sim              |1,235                           |Python/C#/Java                           |500                           |
-| RepoBench-P| 代码 | Edit Sim                |5,622                           |Python/Java                           |500                           |
+| Task          | Task Type | Eval metric |     Avg len                            |Language | \#Sample |
+| --------- | -------------| ------------- |--------- | ------------- |--------- |
+| HotpotQA   | Multi-doc QA | F1                        |9149                           |EN                           |200                           |
+| 2WikiMultihopQA| Multi-doc QA | F1                        |4885                           |EN                           |200                           |
+| Musique| Multi-doc QA | F1                        |7798                           |EN                           |200                           |
+| DuReader| Multi-doc QA | Rouge-L                 |15768                           |ZH                           |200                           |
+| MultiFieldQA-en| Single-doc QA | F1                        |4559                           |EN                           |150                           |
+| MultiFieldQA-zh| Single-doc QA | F1                        |6771                           |ZH                           |200                           |
+| NarrativeQA| Single-doc QA | F1                        |18405                           |EN                           |200                           |
+| Qasper| Single-doc QA | F1                        |3619                           |EN                           |200                           |
+| GovReport| Summarization | Rouge-L                 |8169                           |EN                           |200                           |
+| QMSum| Summarization | Rouge-L                 |10546                           |EN                           |200                           |
+| VCSUM| Summarization | Rouge-L                 |15147                           |ZH                           |200                           |
+| TriviaQA| Few shot  | F1                        |8015                           |EN                           |200                           |
+| NQ| Few shot | F1                        |8210                           |EN                           |200                           |
+| TREC| Few shot | Accuracy                |5176                           |EN                           |200                           |
+| LSHT| Few shot | Accuracy                |22333                           |ZH                           |200                           |
+| PassageRetrieval-en| Synthetic | Accuracy                |9288                           |EN                           |200                           |
+| PassageCount| Synthetic | Accuracy                |11141                           |EN                           |200  |
+| PassageRetrieval-zh | Synthetic | Accuracy                |6745                           |ZH                           |200                           |
+| LCC| Code | Edit Sim              |1235                           |Python/C#/Java                           |500                           |
+| RepoBench-P| Code | Edit Sim                |5622                           |Python/Java                           |500                           |
 
-> 注:为了避免不同Tokenizer统计的差距,我们使用单词数(Python的split函数)来统计英文数据集和代码数据集的平均长度,使用汉字数来统计中文数据集的平均长度。
+> Note: In order to avoid discrepancies caused by different tokenizers, we use the word count (using Python's split function) to calculate the average length of English datasets and code datasets, and use the character count to calculate the average length of Chinese datasets.
 
-# 任务说明
+# Task description
 
-| 任务              | 任务说明                                                     |
-| :----------------- | :----------------------------------------------------------- |
-| HotpotQA          | 基于多篇给定的文档,回答相关问题                             |
-| 2WikiMultihopQA   | 基于多篇给定的文档,回答相关问题                             |
-| Musique           | 基于多篇给定的文档,回答相关问题                             |
-| DuReader          | 基于多篇给定的检索文档,回答相关的中文问题                   |
-| MultiFieldQA-en   | 基于单篇文档,回答英文问题,文档所属的领域相对多元           |
-| MultiFieldQA-zh   | 基于单篇文档,回答中文问题,文档所属的领域相对多元           |
-| NarrativeQA       | 基于故事或剧本提问,包括对人物、情节、主题等重要元素的理解   |
-| Qasper            | 基于单篇论文的提出,问题由NLP的读者提出,并由NLP从业者回答   |
-| GovReport         | 摘要任务,要求对政府的工作报告进行总结摘要                   |
-| QMSum             | 摘要任务,要求基于用户的查询对会议记录进行摘要               |
-| VCSUM             | 摘要任务,要求对中文会议记录进行总结摘要                     |
-| TriviaQA          | 单文档问答任务,提供若干的Few Shot样例                       |
-| NQ                | 单文档问答任务,提供若干的Few Shot样例                       |
-| TREC              | 分类任务,要求对问题进行分类,一共包含50个类别               |
-| LSHT              | 中文分类任务,要求对新闻进行分类,一共包含24个类别           |
-| PassageRetrieval-en | 给定30个英文维基的段落,判断给定的摘要属于哪个段落           |
-| PassageCount | 判断给定的若干的段落中不重复的段落一共有几个           |
-| PassageRetrieval-zh | 给定若干个出自C4数据集的中文段落,判断给定的摘要属于哪个段落 |
-| LCC               | 给定一段较长代码,要求预测出下一行代码                       |
-| RepoBench-P       | 给定一个github仓库内多个文件中的代码(包含文件间依赖),要求预测出下一行代码 |
+| Task              | Task Description                                            |
+| ----------------- | ------------------------------------------------------------ |
+| HotpotQA          | Answer related questions based on multiple given documents   |
+| 2WikiMultihopQA   | Answer related questions based on multiple given documents   |
+| Musique           | Answer related questions based on multiple given documents   |
+| DuReader          | Answer related Chinese questions based on multiple retrieved documents |
+| MultiFieldQA-en   | Answer English questions based on a single document, which comes from a relatively diverse field |
+| MultiFieldQA-zh   | Answer Chinese questions based on a single document, which comes from a relatively diverse field |
+| NarrativeQA       | Ask questions based on stories or scripts, including understanding of important elements such as characters, plots, themes, etc. |
+| Qasper            | Ask questions based on a single paper, questions proposed by NLP readers, and answered by NLP practitioners |
+| GovReport         | A summarization task that requires summarizing government work reports |
+| QMSum             | A summarization task that requires summarizing meeting records based on user queries |
+| VCSUM             | A summarization task that requires summarizing Chinese meeting records |
+| TriviaQA          | Single document question answering task, providing several Few Shot examples |
+| NQ                | Single document question answering task, providing several Few Shot examples |
+| TREC              | A classification task that requires categorizing questions, includes 50 categories in total |
+| LSHT              | A Chinese classification task that requires categorizing news, includes 24 categories in total |
+| PassageRetrieval-en | Given 30 English Wikipedia paragraphs, determine which paragraph the given summary belongs to |
+| PassageCount | Determine the number of non-repeating paragraphs in a given number of paragraphs |
+| PassageRetrieval-zh | Given several Chinese paragraphs from the C4 data set, determine which paragraph the given abstract belongs to |
+| LCC               | Given a longer piece of code, predict the next line of code |
+| RepoBench-P       | Given code in multiple files within a GitHub repository (including inter-file dependencies), predict the next line of code |
 
-# 数据构造方式
 
-> 注:对于所有基于已有数据集构造的任务,我们均选用原有数据集的验证集或测试集的数据(VCSUM任务除外)
+# Task construction
 
-- [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573)和[DuReader](https://github.com/baidu/DuReader)任务基于原始的数据集构建,并进行相关处理使其适用于长文本评测。具体地,对于验证集中的问题,我们会选取包含答案的evidence passage和若干干扰的文章,这些文章和原始的问题共同组成了相关任务的输入。
-- MultiFiedQA-zh和MultiFieldQA-en任务由约10种来源的长文本数据组成,包含Latex论文、裁判文书、政府工作报告和谷歌索引的PDF文档等。对于每篇长文本,我们邀请了若干博士生和硕士生来进行标注,即基于长文本提问,并给出正确的答案。为了更好地进行自动化评测,我们要求标注员尽可能提出有确定性答案的问题。
-- [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf)和[QMSum](https://arxiv.org/pdf/2104.05938.pdf)任务直接使用原论文提供的数据。在具体的构建中,我们使用[ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/)提供的模板来将对应的数据转换为纯文本的输入。
-- [VCSUM](https://arxiv.org/abs/2305.05280)任务基于原始的数据集构建,我们针对该数据设计了相应的模板将对应的数据转换为纯文本的输入。
-- [TriviaQA](https://nlp.cs.washington.edu/triviaqa/)和[NQ](https://ai.google.com/research/NaturalQuestions/)任务参考[CoLT5](https://arxiv.org/abs/2303.09752)的方式进行构建,即会提供若干基于文档进行问答的样例,并要求语言模型基于新的文档回答相关问题。
-- [TREC](https://aclanthology.org/C02-1150.pdf)和[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf)任务基于原始的数据集构建。对于验证集中的每个问题,我们采样训练集中的若干数据组成Few-shot样例。这些样例会和验证集中的问题共同组成该任务的输入。
-- PassageRetrieval-en任务基于英文维基进行构造。对于每条数据,我们随机采样30段英文维基的段落,并选取其中一段进行摘要(使用GPT-3.5-Turbo)。该任务要求模型给出摘要应该对应哪个的原始段落。
-- PassageCount任务基于英文维基进行构造。对于每条数据,我们随机采样若干英文维基的段落,并将其中的每个段落随机重复若干次,最后将段落随机打乱。该任务要求模型判断给定的若干的段落中不重复的段落一共有几个。
-- PassageRetrieval-zh任务基于[C4](https://arxiv.org/abs/1910.10683)进行构造。对于每条数据,我们随机采样若干段来自于C4的中文段落,并选取其中一段进行摘要(使用GPT-3.5-Turbo)。该任务要求模型给出摘要应该对应哪个的原始段落。
-- [LCC](https://arxiv.org/abs/2306.14893)任务我们基于原始的代码补全数据集采样构建。[RepoBench-P](https://arxiv.org/abs/2306.03091)任务中我们选取了原数据集最具挑战性的XF-F(Cross-File-First)设定,并且参考原文中的Oracle-Filled场景,对于每一条原始数据我们随机抽取包括有效跨文件代码片段(gold snippet)在内的多个跨文件代码片段,将其拼接后作为输入,要求模型从其中利用有效的跨文件代码以补全当前文件中的代码。
\ No newline at end of file
+> Note: For all tasks constructed from existing datasets, we use data from the validation or test set of the existing dataset (except for VCSUM).
+
+- The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to make them suitable for long text evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the related tasks.
+- The tasks of MultiFiedQA-zh and MultiFieldQA-en consist of long-text data from about 10 sources, including Latex papers, judicial documents, government work reports, and PDF documents indexed by Google. For each long text, we invite several PhD and master students to annotate, i.e., to ask questions based on the long text and give the correct answers. To better automate evaluation, we require the annotators to ask questions with definitive answers as much as possible.
+- The tasks of [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), and [QMSum](https://arxiv.org/pdf/2104.05938.pdf) directly use the data provided by the original papers. In the specific construction, we use the template provided by [ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/) to convert the corresponding data into pure text input.
+- The [VCSUM](https://arxiv.org/abs/2305.05280) task is built based on the original dataset, and we have designed a corresponding template to convert the corresponding data into pure text input.
+- The tasks of [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) and [NQ](https://ai.google.com/research/NaturalQuestions/) are constructed in the manner of [CoLT5](https://arxiv.org/abs/2303.09752), which provides several examples of question and answering based on documents, and requires the language model to answer related questions based on new documents.
+- The tasks of [TREC](https://aclanthology.org/C02-1150.pdf) and [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf) are built based on the original datasets. For each question in the validation set, we sample several data from the training set to form few-shot examples. These examples together with the questions in the validation set constitute the input for this task.
+- The PassageRetrieval-en task is constructed based on English Wikipedia. For each piece of data, we randomly sample 30 paragraphs from English Wikipedia and select one for summarization (using GPT3.5 Turbo). The task requires the model to specify which original paragraph the summary corresponds to.
+- The PassageCount task is constructed based on the English wiki. For each piece of data, we randomly sample several passages from English Wikipedia, repeat each paragraph at random several times, and finally shuffle the paragraphs at random. This task requires the model to determine how many paragraphs are not repeated in a given number of paragraphs.
+- The PasskeyRetrieval-zh task is constructed based on [C4](https://arxiv.org/abs/1910.10683). For each piece of data, we randomly sample several Chinese paragraphs from C4 and select one of them for summary (using GPT3.5 Turbo). This task requires the model to give the original paragraph to which the summary should correspond.
+- For the [LCC](https://arxiv.org/abs/2306.14893) task, we sample from the original code completion dataset. In the [RepoBench-P](https://arxiv.org/abs/2306.03091) task, we select the most challenging XF-F (Cross-File-First) setting from the original dataset and refer to the Oracle-Filled scenario in the original text. For each original piece of data, we randomly extract multiple cross-file code snippets, including the effective cross-file code snippet (gold snippet), and concatenate them as input, requiring the model to use the effective cross-file code to complete the current file's code.
\ No newline at end of file
diff --git a/task_en.md b/task_en.md
deleted file mode 100644
index 186164f..0000000
--- a/task_en.md
+++ /dev/null
@@ -1,67 +0,0 @@
-# Task statistics
-
-| Task          | Task Type | Eval metric |     Avg len                            |Language | \#Sample |
-| --------- | -------------| ------------- |--------- | ------------- |--------- |
-| HotpotQA   | Multi-doc QA | F1                        |9149                           |EN                           |200                           |
-| 2WikiMultihopQA| Multi-doc QA | F1                        |4885                           |EN                           |200                           |
-| Musique| Multi-doc QA | F1                        |7798                           |EN                           |200                           |
-| DuReader| Multi-doc QA | Rouge-L                 |15768                           |ZH                           |200                           |
-| MultiFieldQA-en| Single-doc QA | F1                        |4559                           |EN                           |150                           |
-| MultiFieldQA-zh| Single-doc QA | F1                        |6771                           |ZH                           |200                           |
-| NarrativeQA| Single-doc QA | F1                        |18405                           |EN                           |200                           |
-| Qasper| Single-doc QA | F1                        |3619                           |EN                           |200                           |
-| GovReport| Summarization | Rouge-L                 |8169                           |EN                           |200                           |
-| QMSum| Summarization | Rouge-L                 |10546                           |EN                           |200                           |
-| VCSUM| Summarization | Rouge-L                 |15147                           |ZH                           |200                           |
-| TriviaQA| Few shot  | F1                        |8015                           |EN                           |200                           |
-| NQ| Few shot | F1                        |8210                           |EN                           |200                           |
-| TREC| Few shot | Accuracy                |5176                           |EN                           |200                           |
-| LSHT| Few shot | Accuracy                |22333                           |ZH                           |200                           |
-| PassageRetrieval-en| Synthetic | Accuracy                |9288                           |EN                           |200                           |
-| PassageCount| Synthetic | Accuracy                |11141                           |EN                           |200  |
-| PassageRetrieval-zh | Synthetic | Accuracy                |6745                           |ZH                           |200                           |
-| LCC| Code | Edit Sim              |1235                           |Python/C#/Java                           |500                           |
-| RepoBench-P| Code | Edit Sim                |5622                           |Python/Java                           |500                           |
-
-> Note: In order to avoid discrepancies caused by different tokenizers, we use the word count (using Python's split function) to calculate the average length of English datasets and code datasets, and use the character count to calculate the average length of Chinese datasets.
-
-# Task description
-
-| Task              | Task Description                                            |
-| ----------------- | ------------------------------------------------------------ |
-| HotpotQA          | Answer related questions based on multiple given documents   |
-| 2WikiMultihopQA   | Answer related questions based on multiple given documents   |
-| Musique           | Answer related questions based on multiple given documents   |
-| DuReader          | Answer related Chinese questions based on multiple retrieved documents |
-| MultiFieldQA-en   | Answer English questions based on a single document, which comes from a relatively diverse field |
-| MultiFieldQA-zh   | Answer Chinese questions based on a single document, which comes from a relatively diverse field |
-| NarrativeQA       | Ask questions based on stories or scripts, including understanding of important elements such as characters, plots, themes, etc. |
-| Qasper            | Ask questions based on a single paper, questions proposed by NLP readers, and answered by NLP practitioners |
-| GovReport         | A summarization task that requires summarizing government work reports |
-| QMSum             | A summarization task that requires summarizing meeting records based on user queries |
-| VCSUM             | A summarization task that requires summarizing Chinese meeting records |
-| TriviaQA          | Single document question answering task, providing several Few Shot examples |
-| NQ                | Single document question answering task, providing several Few Shot examples |
-| TREC              | A classification task that requires categorizing questions, includes 50 categories in total |
-| LSHT              | A Chinese classification task that requires categorizing news, includes 24 categories in total |
-| PassageRetrieval-en | Given 30 English Wikipedia paragraphs, determine which paragraph the given summary belongs to |
-| PassageCount | Determine the number of non-repeating paragraphs in a given number of paragraphs |
-| PassageRetrieval-zh | Given several Chinese paragraphs from the C4 data set, determine which paragraph the given abstract belongs to |
-| LCC               | Given a longer piece of code, predict the next line of code |
-| RepoBench-P       | Given code in multiple files within a GitHub repository (including inter-file dependencies), predict the next line of code |
-
-
-# Task construction
-
-> Note: For all tasks constructed from existing datasets, we use data from the validation or test set of the existing dataset (except for VCSUM).
-
-- The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to make them suitable for long text evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the related tasks.
-- The tasks of MultiFiedQA-zh and MultiFieldQA-en consist of long-text data from about 10 sources, including Latex papers, judicial documents, government work reports, and PDF documents indexed by Google. For each long text, we invite several PhD and master students to annotate, i.e., to ask questions based on the long text and give the correct answers. To better automate evaluation, we require the annotators to ask questions with definitive answers as much as possible.
-- The tasks of [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), and [QMSum](https://arxiv.org/pdf/2104.05938.pdf) directly use the data provided by the original papers. In the specific construction, we use the template provided by [ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/) to convert the corresponding data into pure text input.
-- The [VCSUM](https://arxiv.org/abs/2305.05280) task is built based on the original dataset, and we have designed a corresponding template to convert the corresponding data into pure text input.
-- The tasks of [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) and [NQ](https://ai.google.com/research/NaturalQuestions/) are constructed in the manner of [CoLT5](https://arxiv.org/abs/2303.09752), which provides several examples of question and answering based on documents, and requires the language model to answer related questions based on new documents.
-- The tasks of [TREC](https://aclanthology.org/C02-1150.pdf) and [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf) are built based on the original datasets. For each question in the validation set, we sample several data from the training set to form few-shot examples. These examples together with the questions in the validation set constitute the input for this task.
-- The PassageRetrieval-en task is constructed based on English Wikipedia. For each piece of data, we randomly sample 30 paragraphs from English Wikipedia and select one for summarization (using GPT3.5 Turbo). The task requires the model to specify which original paragraph the summary corresponds to.
-- The PassageCount task is constructed based on the English wiki. For each piece of data, we randomly sample several passages from English Wikipedia, repeat each paragraph at random several times, and finally shuffle the paragraphs at random. This task requires the model to determine how many paragraphs are not repeated in a given number of paragraphs.
-- The PasskeyRetrieval-zh task is constructed based on [C4](https://arxiv.org/abs/1910.10683). For each piece of data, we randomly sample several Chinese paragraphs from C4 and select one of them for summary (using GPT3.5 Turbo). This task requires the model to give the original paragraph to which the summary should correspond.
-- For the [LCC](https://arxiv.org/abs/2306.14893) task, we sample from the original code completion dataset. In the [RepoBench-P](https://arxiv.org/abs/2306.03091) task, we select the most challenging XF-F (Cross-File-First) setting from the original dataset and refer to the Oracle-Filled scenario in the original text. For each original piece of data, we randomly extract multiple cross-file code snippets, including the effective cross-file code snippet (gold snippet), and concatenate them as input, requiring the model to use the effective cross-file code to complete the current file's code.
\ No newline at end of file
diff --git a/task_zh.md b/task_zh.md
new file mode 100644
index 0000000..67a6a0d
--- /dev/null
+++ b/task_zh.md
@@ -0,0 +1,66 @@
+# 任务统计
+
+| 任务              |      任务类型  |      评价指标  |     平均长度                                       |语言 | Sample数量|
+| :--------- | :-----------:| :-----------: |:---------: | :-------------: |:---------: |
+| HotpotQA   | 多文档QA       | F1                        |9,149                           |英文                           |200                           |
+| 2WikiMultihopQA| 多文档QA | F1                        |4,885                           |英文                           |200                           |
+| Musique| 多文档QA   | F1                        |7,798                           |英文                           |200                           |
+| DuReader| 多文档QA  | Rouge-L                 |15,768                           |中文                           |200                           |
+| MultiFieldQA-en| 单文档QA | F1                        |4,559                           |英文                           |150                           |
+| MultiFieldQA-zh| 单文档QA | F1                        |6,771                           |中文                           |200                           |
+| NarrativeQA| 单文档QA | F1                        |18,405                           |英文                           |200                           |
+| Qasper| 单文档QA    | F1                        |3,619                           |英文                           |200                           |
+| GovReport| 摘要 | Rouge-L                 |8,169                           |英文                           |200                           |
+| QMSum| 摘要     | Rouge-L                 |10,546                           |英文                           |200                           |
+| VCSUM| 摘要     | Rouge-L                 |15,147                           |中文                           |200                           |
+| TriviaQA| Few shot  | F1                        |8,015                           |英文                           |200                           |
+| NQ| Few shot | F1                        |8,210                           |英文                           |200                           |
+| TREC| Few shot | Accuracy                |5,176                           |英文                           |200                           |
+| LSHT| Few shot | Accuracy                |22,333                           |中文                           |200                           |
+| PassageRetrieval-en| 合成任务 | Accuracy                |9,288                           |英文                           |200                           |
+| PassageCount| 合成任务 | Accuracy                |11,141                           |英文                           |200                           |
+| PassageRetrieval-zh | 合成任务 | Accuracy                |6,745                           |中文                           |200                           |
+| LCC| 代码 | Edit Sim              |1,235                           |Python/C#/Java                           |500                           |
+| RepoBench-P| 代码 | Edit Sim                |5,622                           |Python/Java                           |500                           |
+
+> 注:为了避免不同Tokenizer统计的差距,我们使用单词数(Python的split函数)来统计英文数据集和代码数据集的平均长度,使用汉字数来统计中文数据集的平均长度。
+
+# 任务说明
+
+| 任务              | 任务说明                                                     |
+| :----------------- | :----------------------------------------------------------- |
+| HotpotQA          | 基于多篇给定的文档,回答相关问题                             |
+| 2WikiMultihopQA   | 基于多篇给定的文档,回答相关问题                             |
+| Musique           | 基于多篇给定的文档,回答相关问题                             |
+| DuReader          | 基于多篇给定的检索文档,回答相关的中文问题                   |
+| MultiFieldQA-en   | 基于单篇文档,回答英文问题,文档所属的领域相对多元           |
+| MultiFieldQA-zh   | 基于单篇文档,回答中文问题,文档所属的领域相对多元           |
+| NarrativeQA       | 基于故事或剧本提问,包括对人物、情节、主题等重要元素的理解   |
+| Qasper            | 基于单篇论文的提出,问题由NLP的读者提出,并由NLP从业者回答   |
+| GovReport         | 摘要任务,要求对政府的工作报告进行总结摘要                   |
+| QMSum             | 摘要任务,要求基于用户的查询对会议记录进行摘要               |
+| VCSUM             | 摘要任务,要求对中文会议记录进行总结摘要                     |
+| TriviaQA          | 单文档问答任务,提供若干的Few Shot样例                       |
+| NQ                | 单文档问答任务,提供若干的Few Shot样例                       |
+| TREC              | 分类任务,要求对问题进行分类,一共包含50个类别               |
+| LSHT              | 中文分类任务,要求对新闻进行分类,一共包含24个类别           |
+| PassageRetrieval-en | 给定30个英文维基的段落,判断给定的摘要属于哪个段落           |
+| PassageCount | 判断给定的若干的段落中不重复的段落一共有几个           |
+| PassageRetrieval-zh | 给定若干个出自C4数据集的中文段落,判断给定的摘要属于哪个段落 |
+| LCC               | 给定一段较长代码,要求预测出下一行代码                       |
+| RepoBench-P       | 给定一个github仓库内多个文件中的代码(包含文件间依赖),要求预测出下一行代码 |
+
+# 数据构造方式
+
+> 注:对于所有基于已有数据集构造的任务,我们均选用原有数据集的验证集或测试集的数据(VCSUM任务除外)
+
+- [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573)和[DuReader](https://github.com/baidu/DuReader)任务基于原始的数据集构建,并进行相关处理使其适用于长文本评测。具体地,对于验证集中的问题,我们会选取包含答案的evidence passage和若干干扰的文章,这些文章和原始的问题共同组成了相关任务的输入。
+- MultiFiedQA-zh和MultiFieldQA-en任务由约10种来源的长文本数据组成,包含Latex论文、裁判文书、政府工作报告和谷歌索引的PDF文档等。对于每篇长文本,我们邀请了若干博士生和硕士生来进行标注,即基于长文本提问,并给出正确的答案。为了更好地进行自动化评测,我们要求标注员尽可能提出有确定性答案的问题。
+- [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf)和[QMSum](https://arxiv.org/pdf/2104.05938.pdf)任务直接使用原论文提供的数据。在具体的构建中,我们使用[ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/)提供的模板来将对应的数据转换为纯文本的输入。
+- [VCSUM](https://arxiv.org/abs/2305.05280)任务基于原始的数据集构建,我们针对该数据设计了相应的模板将对应的数据转换为纯文本的输入。
+- [TriviaQA](https://nlp.cs.washington.edu/triviaqa/)和[NQ](https://ai.google.com/research/NaturalQuestions/)任务参考[CoLT5](https://arxiv.org/abs/2303.09752)的方式进行构建,即会提供若干基于文档进行问答的样例,并要求语言模型基于新的文档回答相关问题。
+- [TREC](https://aclanthology.org/C02-1150.pdf)和[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf)任务基于原始的数据集构建。对于验证集中的每个问题,我们采样训练集中的若干数据组成Few-shot样例。这些样例会和验证集中的问题共同组成该任务的输入。
+- PassageRetrieval-en任务基于英文维基进行构造。对于每条数据,我们随机采样30段英文维基的段落,并选取其中一段进行摘要(使用GPT-3.5-Turbo)。该任务要求模型给出摘要应该对应哪个的原始段落。
+- PassageCount任务基于英文维基进行构造。对于每条数据,我们随机采样若干英文维基的段落,并将其中的每个段落随机重复若干次,最后将段落随机打乱。该任务要求模型判断给定的若干的段落中不重复的段落一共有几个。
+- PassageRetrieval-zh任务基于[C4](https://arxiv.org/abs/1910.10683)进行构造。对于每条数据,我们随机采样若干段来自于C4的中文段落,并选取其中一段进行摘要(使用GPT-3.5-Turbo)。该任务要求模型给出摘要应该对应哪个的原始段落。
+- [LCC](https://arxiv.org/abs/2306.14893)任务我们基于原始的代码补全数据集采样构建。[RepoBench-P](https://arxiv.org/abs/2306.03091)任务中我们选取了原数据集最具挑战性的XF-F(Cross-File-First)设定,并且参考原文中的Oracle-Filled场景,对于每一条原始数据我们随机抽取包括有效跨文件代码片段(gold snippet)在内的多个跨文件代码片段,将其拼接后作为输入,要求模型从其中利用有效的跨文件代码以补全当前文件中的代码。
\ No newline at end of file
-- 
GitLab