Refine README

47fed153 · Lv Xin · 67e4b2e6 · 47fed153 · 47fed153
Commit 47fed153 authored 1 year ago by Lv Xin
--- a/README.md
+++ b/README.md
@@ -5,13 +5,13 @@

 阅读[中文版本](README_ZH.md).

-# LongBench: A Multilingual, Multitask Benchmark for Long Context Understanding
+# LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

-**LongBench** is the first benchmark for multilingual, multitask, and comprehensive assessment of **long context understanding** capabilities of large language models. LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models' multilingual capabilities on long contexts. In addition, LongBench consists of 20 different tasks, covering key long context application scenarios such as single-document QA, multi-document QA, summarization, few-shot learning, code completion, and synthetic tasks.
+**LongBench** is the first benchmark for bilingual, multitask, and comprehensive assessment of **long context understanding** capabilities of large language models. LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models' multilingual capabilities on long contexts. In addition, LongBench is composed of six major categories and twenty different tasks, covering key long-text application scenarios such as multi-document QA, single-document QA, summarization, Few-shot learning, code completion, and synthesis tasks.

 We are fully aware of the potentially high costs involved in the model evaluation process, especially in the context of long context scenarios (such as manual annotation costs or API call costs). Therefore, we adopt a fully automated evaluation method, aimed at measuring and evaluating the model's ability to understand long contexts at the lowest cost.

-LongBench includes 13 English tasks, 5 Chinese tasks, and 2 code tasks, with the average length of most tasks ranging from 5k to 15k. For the main task categories, LongBench includes six types of tasks, namely multi-document QA, single-document QA, summaries, Few-shot learning, synthetic tasks, and code completion. For detailed statistics and construction methods of LongBench tasks, please refer [here](task.md).
+LongBench includes 13 English tasks, 5 Chinese tasks, and 2 code tasks, with the average length of most tasks ranging from 5k to 15k. For detailed statistics and construction methods of LongBench tasks, please refer [here](task.md).

 | Task Type | \#English Task | \#Chinese Task | \#Code Task |
 | :-------: | :--------------------: | :--------------------: | :------------------: |
@@ -32,6 +32,8 @@ LongBench includes 13 English tasks, 5 Chinese tasks, and 2 code tasks, with the
 ## Leaderboard
 Here is the average scores (%) on the main task categories in both Chinese and English languages under the Zero-shot scenario. Please refer to this [link](task.md) for the evaluation metrics used for each task.

+> Note: For text exceeding the processing length capability of the model, we truncate from the middle of the text, preserving information from the beginning and end, in accordance with the observations from [Lost in the Middle](https://arxiv.org/abs/2307.03172). Experiments show that this truncation method has the least impact on model performance.
+
 #### English
 |                   | Avg  | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks |
 | ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: |

--- a/README_ZH.md
+++ b/README_ZH.md
@@ -5,13 +5,13 @@

 Read this in [English](README.md).

-# LongBench: 多语言、多任务长文本理解评测基准
+# LongBench: 多任务中英双语长文本理解评测基准

-**LongBench**是第一个多语言、多任务、全面评估大语言模型**长文本理解**能力的综合数据集。在目前大模型多语言能力引起广泛关注的背景下，LongBench涵盖了不同的语言（中文和英文），以此来对大模型在长文本下的多语言能力进行更全面的评估。同时，LongBench由二十个不同的任务组成，覆盖了单文档QA、多文档QA、摘要、Few-shot学习、代码补全和合成任务等关键的长文本应用场景。
+**LongBench**是第一个多任务、中英双语、针对大语言模型**长文本理解能力**的评测基准。在目前大模型多语言能力引起广泛关注的背景下，LongBench涵盖了不同的语言（中文和英文），以此来对大模型在长文本下的多语言能力进行更全面的评估。同时，LongBench由六大类、二十个不同的任务组成，覆盖了单文档QA、多文档QA、摘要、Few-shot学习、代码补全和合成任务等关键的长文本应用场景。

 我们深知模型评测过程中可能产生的高昂成本，尤其是长文本场景下（如人工标注成本或API调用成本）。因此，我们采用了一种全自动的评测方式，旨在以最低的成本，最有效地衡量和评估模型的长文本理解能力。

-LongBench包含13个英文任务、5个中文任务和2个代码任务，多数任务的平均长度在5k-15k之间，共包含约4500条测试数据。从主要任务分类上，LongBench包含六大类任务，即单文档QA、多文档QA、摘要、Few-shot学习、代码补全和合成任务。关于LongBench数据集的具体统计及任务构造方式请参考[这里](task_zh.md)。
+LongBench包含13个英文任务、5个中文任务和2个代码任务，多数任务的平均长度在5k-15k之间，共包含约4500条测试数据。关于LongBench数据集的具体统计及任务构造方式请参考[这里](task_zh.md)。

 |   任务类型   | 英文任务数 | 中文任务数 | 代码任务数 |
 | :----------: | :--------: | :--------: | :--------: |
@@ -31,6 +31,9 @@ LongBench包含13个英文任务、5个中文任务和2个代码任务，多数

 ## 排行榜
 我们在这里展示了所有模型在Zero-shot场景下，在中文和英文各大类任务上得分的平均值（%），各任务评估所用指标请参考[这里](task_zh.md)。
+
+> 注：对于超出模型处理长度能力的文本，参考[Lost in the Middle](https://arxiv.org/abs/2307.03172)的观察，我们从文本中间进行截断，保持前后部分的信息。实验表明，这种截断方式对模型性能影响最小。
+
 #### 英文榜单
 |     | Avg | 单文档QA | 多文档QA | 摘要 | Few-shot学习 | 代码补全 | 合成任务 |
 | --- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
@@ -106,7 +109,7 @@ python eval.py

 ## 详细评测结果
 下面的几张表格展示了模型在所有子任务数据集上的Zero-shot评测结果（%），其中的中文数据集以“zh”标示（各任务评估所用指标请参考[这里](task_zh.md)）。
-> 注：对于超出模型处理长度能力的文本，参考[Lost in the Middle](https://arxiv.org/abs/2307.03172)的观察，我们从文本中间进行截断，保持前后部分的信息。实验表明，这种截断方式对模型性能影响最小。
+
 #### 单文档QA
 |                   | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh |
 |-------------------|:-----------:|:------:|:---------------:|:---------------:|