From 62fa455dd58c3084c346324bc35f70a057fdacfc Mon Sep 17 00:00:00 2001
From: Lv Xin <lvxin14@163.com>
Date: Sun, 30 Jul 2023 11:18:27 +0800
Subject: [PATCH] Change the default language of README to English

---
 README.md    | 292 +++++++++++++++++++++++++++------------------------
 README_EN.md | 187 ---------------------------------
 README_ZH.md | 175 ++++++++++++++++++++++++++++++
 task.md      | 119 ++++++++++-----------
 task_en.md   |  67 ------------
 task_zh.md   |  66 ++++++++++++
 6 files changed, 453 insertions(+), 453 deletions(-)
 delete mode 100644 README_EN.md
 create mode 100644 README_ZH.md
 delete mode 100644 task_en.md
 create mode 100644 task_zh.md

diff --git a/README.md b/README.md
index d97741a..95519ba 100644
--- a/README.md
+++ b/README.md
@@ -3,62 +3,65 @@
     ðŸ¤— <a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank">HF Repo</a> â€¢ ðŸ“ƒ Paper coming soon!
 </p>
 
-Read this in [English](README_EN.md).
-
-# LongBench: å¤šè¯è¨€ã€å¤šä»»åŠ¡é•¿æ–‡æœ¬ç†è§£è¯„æµ‹åŸºå‡†
-
-**LongBench**æ˜¯ç¬¬ä¸€ä¸ªå¤šè¯è¨€ã€å¤šä»»åŠ¡ã€å…¨é¢è¯„ä¼°å¤§è¯è¨€æ¨¡åž‹**é•¿æ–‡æœ¬ç†è§£**èƒ½åŠ›çš„ç»¼åˆæ•°æ®é›†ã€‚åœ¨ç›®å‰å¤§æ¨¡åž‹å¤šè¯è¨€èƒ½åŠ›å¼•èµ·å¹¿æ³›å…³æ³¨çš„èƒŒæ™¯ä¸‹ï¼ŒLongBenchæ¶µç›–äº†ä¸åŒçš„è¯è¨€ï¼ˆä¸æ–‡å’Œè‹±æ–‡ï¼‰ï¼Œä»¥æ¤æ¥å¯¹å¤§æ¨¡åž‹åœ¨é•¿æ–‡æœ¬ä¸‹çš„å¤šè¯è¨€èƒ½åŠ›è¿›è¡Œæ›´å…¨é¢çš„è¯„ä¼°ã€‚åŒæ—¶ï¼ŒLongBenchç”±äºŒåä¸ªä¸åŒçš„ä»»åŠ¡ç»„æˆï¼Œè¦†ç›–äº†å•æ–‡æ¡£QAã€å¤šæ–‡æ¡£QAã€æ‘˜è¦ã€Few-shotå¦ä¹ ã€ä»£ç è¡¥å…¨å’Œåˆæˆä»»åŠ¡ç‰å…³é”®çš„é•¿æ–‡æœ¬åº”ç”¨åœºæ™¯ã€‚
-
-æˆ‘ä»¬æ·±çŸ¥æ¨¡åž‹è¯„æµ‹è¿‡ç¨‹ä¸å¯èƒ½äº§ç”Ÿçš„é«˜æ˜‚æˆæœ¬ï¼Œå°¤å…¶æ˜¯é•¿æ–‡æœ¬åœºæ™¯ä¸‹ï¼ˆå¦‚äººå·¥æ ‡æ³¨æˆæœ¬æˆ–APIè°ƒç”¨æˆæœ¬ï¼‰ã€‚å› æ¤ï¼Œæˆ‘ä»¬é‡‡ç”¨äº†ä¸€ç§å…¨è‡ªåŠ¨çš„è¯„æµ‹æ–¹å¼ï¼Œæ—¨åœ¨ä»¥æœ€ä½Žçš„æˆæœ¬ï¼Œæœ€æœ‰æ•ˆåœ°è¡¡é‡å’Œè¯„ä¼°æ¨¡åž‹çš„é•¿æ–‡æœ¬ç†è§£èƒ½åŠ›ã€‚
-
-LongBenchåŒ…å«13ä¸ªè‹±æ–‡ä»»åŠ¡ã€5ä¸ªä¸æ–‡ä»»åŠ¡å’Œ2ä¸ªä»£ç ä»»åŠ¡ï¼Œå¤šæ•°ä»»åŠ¡çš„å¹³å‡é•¿åº¦åœ¨5k-15kä¹‹é—´ï¼Œå…±åŒ…å«çº¦4500æ¡æµ‹è¯•æ•°æ®ã€‚ä»Žä¸»è¦ä»»åŠ¡åˆ†ç±»ä¸Šï¼ŒLongBenchåŒ…å«å…å¤§ç±»ä»»åŠ¡ï¼Œå³å•æ–‡æ¡£QAã€å¤šæ–‡æ¡£QAã€æ‘˜è¦ã€Few-shotå¦ä¹ ã€ä»£ç è¡¥å…¨å’Œåˆæˆä»»åŠ¡ã€‚å…³äºŽLongBenchæ•°æ®é›†çš„å…·ä½“ç»Ÿè®¡åŠä»»åŠ¡æž„é€ æ–¹å¼è¯·å‚è€ƒ[è¿™é‡Œ](task.md)ã€‚
-
-|   ä»»åŠ¡ç±»åž‹   | è‹±æ–‡ä»»åŠ¡æ•° | ä¸æ–‡ä»»åŠ¡æ•° | ä»£ç ä»»åŠ¡æ•° |
-| :----------: | :--------: | :--------: | :--------: |
-|   å•æ–‡æ¡£QA   |     3      |     1      |     -      |
-|   å¤šæ–‡æ¡£QA   |     3      |     1      |     -      |
-|     æ‘˜è¦     |     2      |     1      |     -      |
-| Few-shotå¦ä¹  |     3      |     1      |     -      |
-|   åˆæˆä»»åŠ¡   |     2      |     1      |     -      |
-|   ä»£ç è¡¥å…¨   |     -      |     -      |     2      |
-
-## æŽ’è¡Œæ¦œ
-æˆ‘ä»¬åœ¨è¿™é‡Œå±•ç¤ºäº†æ‰€æœ‰æ¨¡åž‹åœ¨Zero-shotåœºæ™¯ä¸‹ï¼Œåœ¨ä¸æ–‡å’Œè‹±æ–‡å„å¤§ç±»ä»»åŠ¡ä¸Šå¾—åˆ†çš„å¹³å‡å€¼ï¼ˆ%ï¼‰ï¼Œå„ä»»åŠ¡è¯„ä¼°æ‰€ç”¨æŒ‡æ ‡è¯·å‚è€ƒ[è¿™é‡Œ](task.md)ã€‚
-#### è‹±æ–‡æ¦œå•
-|     | Avg | å•æ–‡æ¡£QA | å¤šæ–‡æ¡£QA | æ‘˜è¦ | Few-shotå¦ä¹  | ä»£ç è¡¥å…¨ | åˆæˆä»»åŠ¡ |
-| --- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
-| GPT-3.5-Turbo-16k | 45.5 | 39.8 | 38.7 | 26.5 | 76.0 | 54.5 | 37.8 |
-| Llama2-7B-chat-4k | 29.0 | 24.8 | 21.4 | 23.9 | 50.5 | 47.3 | 5.9 |
-| LongChat-7B-16k | 33.7 | 29.3 | 16.1 | 25.8 | 59.9 | 57.0 | 14.2 |
-| XGen-7B-8k | 28.7 | 24.5 | 20.4 | 24.8 | 58.7 | 38.0 | 5.6 |
-| InternLM-7B-8k | 24.7 | 17.1 | 20.8 | 13.3 | 52.7 | 39.7 | 4.7 |
-| ChatGLM2-6B | 26.0 | 23.1 | 15.0 | 22.9 | 46.1 | 46.1 | 2.7 |
-| ChatGLM2-6B-32k | 42.7 | 32.8 | 34.0 | 28.6 | 68.1 | 52.7 | 39.8 |
-
-#### ä¸æ–‡æ¦œå•
-|       | Avg | å•æ–‡æ¡£QA | å¤šæ–‡æ¡£QA | æ‘˜è¦ | Few-shotå¦ä¹  | ä»£ç è¡¥å…¨ | åˆæˆä»»åŠ¡ |
-|-------|:---:|:-------------:|:------------:|:-------------:|:-----------------:|:---------------:|:----------------:|
-| GPT-3.5-Turbo-16k | 44.5 | 61.2 | 28.7 | 16.0 | 29.2 | 54.5 | 77.5 |
-| Llama2-7B-chat-4k | 13.5 | 11.6 | 1.9 | 0.2 | 19.8 | 47.3 | 0.5 |
-| LongChat-7B-16k | 23.7 | 26.6 | 19.1 | 14.0 | 20.8 | 57.0 | 4.8 |
-| XGen-7B-8k | 14.5 | 14.2 | 9.1 | 1.5 | 20.0 | 38.0 | 4.2 |
-| InternLM-7B-8k | 18.6 | 33.3 | 8.9 | 13.0 | 15.5 | 39.7 | 0.9 |
-| ChatGLM2-6B | 22.5 | 33.0 | 15.2 | 14.6 | 20.5 | 46.1 | 5.5 |
-| ChatGLM2-6B-32k | 41.3 | 52.0 | 34.3 | 16.3 | 29.9 | 52.7 | 62.5 |
-
-#### é•¿æ–‡æœ¬ä»»åŠ¡èƒ½åŠ›é›·è¾¾å›¾
+é˜…è¯» [ä¸æ–‡ç‰ˆæœ¬](README_ZH.md)
+
+# LongBench: A Multilingual, Multitask Benchmark Tailored for Long Context Understanding
+
+**LongBench** is the first comprehensive dataset for multi-language, multi-task, and comprehensive assessment of **long text understanding** capabilities of large language models. In the context of the widespread attention to the multi-language capabilities of large models, LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models' multi-language capabilities in long texts. In addition, LongBench consists of twenty different tasks, covering key long-text application scenarios such as single-document QA, multi-document QA, summaries, few-shot learning, code completion, and synthesis tasks.
+
+We are fully aware of the potentially high costs involved in the model evaluation process, especially in the context of long-text scenarios (such as manual annotation costs or API call costs). Therefore, we have adopted a fully automated evaluation method, aimed at measuring and evaluating the model's ability to understand long texts at the lowest cost and most effectively.
+
+LongBench includes 13 English tasks, 5 Chinese tasks, and 2 code tasks, with the average length of most tasks ranging from 5k to 15k. From the main task categories, LongBench includes six types of tasks, namely multi-document QA, single-document QA, summaries, Few-shot learning, synthetic tasks, and code completion. For detailed statistics and construction methods of LongBench tasks, please refer [here](task.md).
+
+| Task Type | \#English Task | \#Chinese Task | \#Code Task |
+| :-------: | :--------------------: | :--------------------: | :------------------: |
+| Multi-document QA | 3 | 1 | - |
+| Single-document QA | 3 | 1 | - |
+| Summarization | 2 | 1 | - |
+| Few-shot learning | 3 | 1 | - |
+| Synthetic Tasks | 2 | 1 | - |
+| Code Completion | - | - | 2 |
+
+## Leaderboard
+Here is the average score (%) of all models on various major tasks in both Chinese and English languages under the Zero-shot scenario. Please refer to this [link](task.md) for the evaluation metrics used for each task.
+
+#### English
+|                   | Avg  | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks |
+| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: |
+| GPT-3.5-Turbo-16k | 45.5 |     39.8      |     38.7     |     26.5      |       76.0        |      54.5       |      37.8       |
+| Llama2-7B-chat-4k | 29.0 |     24.8      |     21.4     |     23.9      |       50.5        |      47.3       |       5.9       |
+| LongChat-7B-16k   | 33.7 |     29.3      |     16.1     |     25.8      |       59.9        |      57.0       |      14.2       |
+| XGen-7B-8k        | 28.7 |     24.5      |     20.4     |     24.8      |       58.7        |      38.0       |       5.6       |
+| InternLM-7B-8k    | 24.7 |     17.1      |     20.8     |     13.3      |       52.7        |      39.7       |       4.7       |
+| ChatGLM2-6B       | 26.0 |     23.1      |     15.0     |     22.9      |       46.1        |      46.1       |       2.7       |
+| ChatGLM2-6B-32k   | 42.7 |     32.8      |     34.0     |     28.6      |       68.1        |      52.7       |      39.8       |
+
+#### Chinese
+
+|                   | Avg  | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks |
+| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: |
+| GPT-3.5-Turbo-16k | 44.5 |     61.2      |     28.7     |     16.0      |       29.2        |      54.5       |      77.5       |
+| Llama2-7B-chat-4k | 13.5 |     11.6      |     1.9      |      0.2      |       19.8        |      47.3       |       0.5       |
+| LongChat-7B-16k   | 23.7 |     26.6      |     19.1     |     14.0      |       20.8        |      57.0       |       4.8       |
+| XGen-7B-8k        | 14.5 |     14.2      |     9.1      |      1.5      |       20.0        |      38.0       |       4.2       |
+| InternLM-7B-8k    | 18.6 |     33.3      |     8.9      |     13.0      |       15.5        |      39.7       |       0.9       |
+| ChatGLM2-6B       | 22.5 |     33.0      |     15.2     |     14.6      |       20.5        |      46.1       |       5.5       |
+| ChatGLM2-6B-32k   | 41.3 |     52.0      |     34.3     |     16.3      |       29.9        |      52.7       |      62.5       |
+
+#### Radar Chart of Long Text Task Capability 
+
 ![](misc/radar.png)
 
-#### ä¸åŒé•¿åº¦æ–‡æœ¬ä¸‹çš„èƒ½åŠ›å˜åŒ–
-ä¸ºäº†æ›´æœ‰é’ˆå¯¹æ€§åœ°åˆ†æžæ¨¡åž‹åœ¨ä¸åŒæ–‡æœ¬é•¿åº¦ä¸‹çš„ç›¸å¯¹è¡¨çŽ°ï¼Œä¸‹å›¾å±•ç¤ºäº†æ¨¡åž‹åœ¨ä¸åŒæ–‡æœ¬é•¿åº¦åŒºé—´ä¸Šï¼Œæ‰€æœ‰ä»»åŠ¡ä¸Šçš„å¹³å‡ç›¸å¯¹åˆ†æ•°ã€‚
+#### Variation of Abilities under Different Text Lengths
+To more specifically analyze the model's relative performance under different text lengths, the following chart shows the average relative scores on all tasks over different text length intervals.
 ![](misc/curve.png)
 
-> æ³¨ï¼šå‡è®¾æ¨¡åž‹åœ¨æŸä¸ªä»»åŠ¡çš„ç‰¹å®šé•¿åº¦èŒƒå›´å†…æ•°æ®ä¸Šå¾—åˆ†ä¸ºxï¼Œåœ¨è¯¥ä»»åŠ¡æ‰€æœ‰æ•°æ®ä¸Šå¾—åˆ†ä¸ºyï¼Œåˆ™æ¨¡åž‹åœ¨è¯¥é•¿åº¦èŒƒå›´çš„**ç›¸å¯¹åˆ†æ•°**ä¸º(x/y-1)ã€‚ä¸ºäº†æ›´å¥½æ¯”è¾ƒä¸åŒæ¨¡åž‹çš„å˜åŒ–è¶‹åŠ¿ï¼Œæˆ‘ä»¬åœ¨0-4kå°†æ‰€æœ‰æŠ˜çº¿å¹³ç§»è‡³0ã€‚
+> Note: Assume that the model scores x on the data within a specific length range of a task, and y on all data of that task, then the model's **relative score** for that length range is (x/y-1). To better compare the trends of different models, we shift all the lines to 0 from 0-4k.
 
-## å¦‚ä½•åœ¨LongBenchä¸Šè¯„æµ‹æ¨¡åž‹
+## How to evaluate models on LongBench
 
-#### è½½å…¥æ•°æ®
-ä½ å¯ä»¥é€šè¿‡Hugging Face datasetsæ¥ä¸‹è½½å¹¶è½½å…¥**LongBench**çš„æ•°æ®ï¼ˆ[ðŸ¤— HF Repo](https://huggingface.co/datasets/THUDM/LongBench)ï¼‰:
+#### Loading Data
+You can download and load the **LongBench** data through the Hugging Face datasets ([ðŸ¤— HF Repo](https://huggingface.co/datasets/THUDM/LongBench)):
 ```python
 from datasets import load_dataset
 
@@ -69,107 +72,116 @@ datasets = ["hotpotqa", "2wikimqa", "musique", "dureader", "narrativeqa", "qaspe
 for dataset in datasets:
     data = load_dataset('THUDM/LongBench', dataset, split='test')
 ```
-åŒæ ·åœ°ï¼Œä½ ä¹Ÿå¯ä»¥ç›´æŽ¥ç”¨è¿™ä¸ª[é“¾æŽ¥](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip)ä¸‹è½½æ‰€æœ‰çš„è¯„æµ‹æ•°æ®ã€‚
+Alternatively, you can download the folder from [this link](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip) to load the data.
+
+#### Data Format
+
+All data in **LongBench** are standardized to the following format:
 
-#### æ•°æ®æ ¼å¼
-**LongBench**ä¸æ‰€æœ‰æ•°æ®éƒ½ç»Ÿä¸€ä¸ºä»¥ä¸‹æ ¼å¼ï¼š
 ```json
 {
-    "input": "ä»»åŠ¡çš„è¾“å…¥/æŒ‡ä»¤ï¼Œé€šå¸¸è¾ƒçŸï¼Œæ¯”å¦‚QAä¸çš„é—®é¢˜ã€Few-shotä»»åŠ¡ä¸çš„æé—®ç‰",
-    "context": "ä»»åŠ¡æ‰€éœ€çš„é•¿è¯å¢ƒæ–‡æœ¬ï¼Œæ¯”å¦‚æ–‡æ¡£ã€è·¨æ–‡ä»¶ä»£ç ã€Few-shotä»»åŠ¡ä¸çš„few-shotæ ·æœ¬",
-    "answers": "ç”±æ‰€æœ‰æ ‡å‡†ç”æ¡ˆç»„æˆçš„åˆ—è¡¨",
-    "length": "å‰ä¸‰é¡¹æ–‡æœ¬çš„æ€»é•¿åº¦ï¼ˆä¸ã€è‹±æ–‡åˆ†åˆ«ç”¨å—ã€è¯æ•°ç»Ÿè®¡ï¼‰",
-    "dataset": "æœ¬æ¡æ•°æ®æ‰€å±žæ•°æ®é›†åç§°",
-    "language": "æœ¬æ¡æ•°æ®çš„è¯è¨€",
-    "all_classes": "åˆ†ç±»ä»»åŠ¡ä¸çš„æ‰€æœ‰ç±»åˆ«ï¼Œéžåˆ†ç±»ä»»åŠ¡åˆ™ä¸ºnull",
-    "_id": "æ¯æ¡æ•°æ®çš„éšæœºid"
+    "input": "The input/command for the task, usually short, such as questions in QA, queries in Few-shot tasks, etc.",
+    "context": "The long context text required for the task, such as documents, cross-file code, few-shot samples in Few-shot tasks",
+    "answers": "List composed of all standard answers",
+    "length": "Total length of the first three items of text (counted in characters for Chinese and words for English)",
+    "dataset": "The name of the dataset to which this piece of data belongs",
+    "language": "The language of this piece of data",
+    "all_classes": "All categories in classification tasks, null for non-classification tasks",
+    "_id": "Random id for each piece of data"
 }
 ```
 
-#### è¯„æµ‹
-æˆ‘ä»¬ä»¥ChatGLM2-6Bä¸ºä¾‹æä¾›äº†ä¸€ä»½è¯„æµ‹ä»£ç ã€‚é¦–å…ˆï¼Œè¿è¡Œä»“åº“ä¸‹çš„[pred.py](pred.py)
+#### Evaluation
+We provide an evaluation code using ChatGLM2-6B as an example. Firstly, run the [pred.py](pred.py) under the repository:
 ```bash
 CUDA_VISIBLE_DEVICES=0 python pred.py
 ```
-å¯ä»¥åœ¨`pred/`æ–‡ä»¶å¤¹ä¸‹å¾—åˆ°æ¨¡åž‹åœ¨æ‰€æœ‰æ•°æ®é›†ä¸‹çš„è¾“å‡ºï¼Œæ¤åŽè¿è¡Œ[eval.py](eval.py)çš„è¯„æµ‹ä»£ç ï¼š
+You can get the model outputs on all datasets in the `pred/` folder. After that, run the evaluation code of [eval.py](eval.py):
 ```bash
 python eval.py
 ```
-å¯ä»¥åœ¨`result.json`ä¸å¾—åˆ°åœ¨å„æ•°æ®é›†ä¸Šçš„è¯„æµ‹ç»“æžœã€‚è¯·æ³¨æ„ï¼Œæˆ‘ä»¬åœ¨`config/`ä¸‹æä¾›äº†æˆ‘ä»¬æ€»ç»“å‡ºæ¥çš„åœ¨å„æ•°æ®é›†ä¸Šé€‚åˆçš„è¾“å…¥æ ¼å¼å’Œæœ€å¤§è¾“å‡ºé•¿åº¦é™åˆ¶ï¼Œåœ¨è¯„æµ‹çš„æ—¶å€™å¯ä»¥è¿›è¡Œä¿®æ”¹ä»¥æ›´å¥½åœ°é€‚ç”¨ä½ è¦è¯„æµ‹çš„æ¨¡åž‹ï¼Œä¿®æ”¹åŽåœ¨[pred.py](pred.py)è¯„æµ‹æ—¶ä¼šè‡ªåŠ¨æŒ‰ç…§æ–°çš„æ ¼å¼åŽ»æ•´ç†æ•°æ®å¹¶å¾—åˆ°å¯¹åº”çš„æ¨¡åž‹è¾“å‡ºã€‚
+You can get the evaluation results on various datasets in `result.json`. Please note that we provide the input format suitable for each dataset and the maximum output length limit we summarized in `config/`. You can modify them during the evaluation to better suit the model you want to evaluate. After modification, when evaluating with [pred.py](pred.py), the data will be automatically organized according to the new format to get the corresponding model output.
+
+## Evaluation Result on Each Dataset
+
+The following tables display the Zero-shot evaluation results (%) of the model on all subtask datasets, where Chinese datasets are denoted by "zh" (please refer to this [link](task.md) for the evaluation metrics used for each task).
 
-## è¯¦ç»†è¯„æµ‹ç»“æžœ
-ä¸‹é¢çš„å‡ å¼ è¡¨æ ¼å±•ç¤ºäº†æ¨¡åž‹åœ¨æ‰€æœ‰åä»»åŠ¡æ•°æ®é›†ä¸Šçš„Zero-shotè¯„æµ‹ç»“æžœï¼ˆ%ï¼‰ï¼Œå…¶ä¸çš„ä¸æ–‡æ•°æ®é›†ä»¥â€œzhâ€æ ‡ç¤ºï¼ˆå„ä»»åŠ¡è¯„ä¼°æ‰€ç”¨æŒ‡æ ‡è¯·å‚è€ƒ[è¿™é‡Œ](task.md)ï¼‰ã€‚
-> æ³¨ï¼šå¯¹äºŽè¶…å‡ºæ¨¡åž‹å¤„ç†é•¿åº¦èƒ½åŠ›çš„æ–‡æœ¬ï¼Œå‚è€ƒ[Lost in the Middle](https://arxiv.org/abs/2307.03172)çš„è§‚å¯Ÿï¼Œæˆ‘ä»¬ä»Žæ–‡æœ¬ä¸é—´è¿›è¡Œæˆªæ–ï¼Œä¿æŒå‰åŽéƒ¨åˆ†çš„ä¿¡æ¯ã€‚å®žéªŒè¡¨æ˜Žï¼Œè¿™ç§æˆªæ–æ–¹å¼å¯¹æ¨¡åž‹æ€§èƒ½å½±å“æœ€å°ã€‚
-#### å•æ–‡æ¡£QA
+#### Single-Document QA
 |                   | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh |
-|-------------------|:-----------:|:------:|:---------------:|:---------------:|
-| GPT-3.5-Turbo-16k | 23.6 | 43.3 | 52.3 | 61.2 |
-| Llama2-7B-chat-4k | 19.1 | 19.6 | 35.8 | 11.6 |
-| LongChat-7B-16k | 21.6 | 21.6 | 44.6 | 26.6 |
-| XGen-7B-8k | 17.9 | 18.3 | 37.2 | 14.2 |
-| InternLM-7B-8k | 12.4 | 16.8 | 22.3 | 33.3 |
-| ChatGLM2-6B | 11.2 | 23.7 | 34.2 | 33.0 |
-| ChatGLM2-6B-32k | 20.4 | 32.2 | 45.7 | 52.0 |
-
-#### å¤šæ–‡æ¡£QA
-|                      | HotpotQA | 2WikiMQA | Musique | DuReader (zh) |
-|----------------------|:--------:|:--------:|:-------:|:--------:|
-| GPT-3.5-Turbo-16k | 51.6 | 37.7 | 26.9 | 28.7 |
-| Llama2-7B-chat-4k | 24.3 | 31.4 | 8.6 | 1.9 |
-| LongChat-7B-16k | 22.4 | 16.8 | 9.1 | 19.1 |
-| XGen-7B-8k | 28.3 | 21.5 | 11.5 | 9.1 |
-| InternLM-7B-8k | 27.9 | 24.0 | 10.3 | 8.9 |
-| ChatGLM2-6B | 20.2 | 19.6 | 5.3 | 15.2 |
-| ChatGLM2-6B-32k | 44.9 | 34.9 | 22.2 | 34.3 |
-
-#### æ‘˜è¦
-|            | GovReport | QMSum | VCSUM (zh) |
-|:-----------|:---------:|:-----:|:-----:|
-| GPT-3.5-Turbo-16k | 29.5 | 23.4 | 16.0 |
-| Llama2-7B-chat-4k | 27.3 | 20.6 | 0.2 |
-| LongChat-7B-16k | 28.4 | 23.2 | 14.0 |
-| XGen-7B-8k | 27.8 | 21.7 | 1.5 |
-| InternLM-7B-8k | 9.8 | 16.8 | 13.0 |
-| ChatGLM2-6B | 23.7 | 22.2 | 14.6 |
-| ChatGLM2-6B-32k | 33.3 | 23.9 | 16.3 |
-
-#### Few-shotå¦ä¹ 
-|     | TREC | NQ | TriviaQA | LSHT (zh) |
-| --- | :-: | :-: | :-: | :-: |
-| GPT-3.5-Turbo-16k | 68.0 | 73.0 | 87.1 | 29.2 |
-| Llama2-7B-chat-4k | 60.5 | 31.4 | 59.7 | 19.8 |
-| LongChat-7B-16k | 61.5 | 44.8 | 73.5 | 20.8 |
-| XGen-7B-8k | 66.0 | 43.2 | 67.0 | 20.0 |
-| InternLM-7B-8k | 49.0 | 47.6 | 61.6 | 15.5 |
-| ChatGLM2-6B | 44.0 | 34.5 | 59.8 | 20.5 |
-| ChatGLM2-6B-32k | 62.0 | 64.9 | 77.6 | 29.9 |
-
-#### ä»£ç è¡¥å…¨
-|     | LCC | RepoBench-P |
-| --- | :-: | :-: |
-| GPT-3.5-Turbo-16k | 54.7 | 54.3 |
-| Llama2-7B-chat-4k | 52.3 | 42.4 |
-| LongChat-7B-16k | 59.2 | 54.7 |
-| XGen-7B-8k | 38.8 | 37.3 |
-| InternLM-7B-8k | 45.5 | 34.0 |
-| ChatGLM2-6B | 48.4 | 43.7 |
-| ChatGLM2-6B-32k | 55.4 | 50.0 |
-
-#### åˆæˆä»»åŠ¡
-|     | PassageRetrieval-en | Passage Count | PassageRetrieval-zh |
-| --- | :-: | :-: | :-: |
-| GPT-3.5-Turbo-16k | 71.0 | 4.5 | 77.5 |
-| Llama2-7B-chat-4k | 9.2 | 2.5 | 0.5 |
-| LongChat-7B-16k | 24.0 | 4.5 | 4.8 |
-| XGen-7B-8k | 9.0 | 2.2 | 4.2 |
-| InternLM-7B-8k | 6.5 | 2.9 | 0.9 |
-| ChatGLM2-6B | 3.2 | 2.1 | 5.5 |
-| ChatGLM2-6B-32k | 77.5 | 2.0 | 62.5 |
-
-## è‡´è°¢
-- **LongBench**çš„éƒ¨åˆ†ä»»åŠ¡åŸºäºŽä¹‹å‰çš„ç ”ç©¶è€…æå‡ºçš„æ•°æ®é›†æž„å»ºï¼ŒåŒ…æ‹¬[HotpotQA](https://hotpotqa.github.io/)ï¼Œ[2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/)ï¼Œ[Musique](https://arxiv.org/abs/2108.00573)ï¼Œ[DuReader](https://github.com/baidu/DuReader)ï¼Œ[NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf)ï¼Œ[Qasper](https://arxiv.org/pdf/2105.03011.pdf)ï¼Œ[GovReport](https://arxiv.org/pdf/2104.02112.pdf)ï¼Œ[QMSum](https://arxiv.org/pdf/2104.05938.pdf)ï¼Œ[VCSUM](https://arxiv.org/abs/2305.05280)ï¼Œ[TriviaQA](https://nlp.cs.washington.edu/triviaqa/)ï¼Œ[NQ](https://ai.google.com/research/NaturalQuestions/)ï¼Œ[TREC](https://aclanthology.org/C02-1150.pdf)ï¼Œ[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf)ï¼Œ[LCC](https://arxiv.org/abs/2306.14893)å’Œ[RepoBench-P](https://arxiv.org/abs/2306.03091)ã€‚
-
-## å¼•ç”¨
-æœ¬å·¥ä½œç”±**THUKEG**å’Œ**Zhipu AI**å…±åŒå®Œæˆï¼Œç›¸å…³è®ºæ–‡æ£åœ¨æ’°å†™ä¸ï¼Œå±Šæ—¶å°†æ›´æ–°å¼•ç”¨ä¿¡æ¯ï¼Œæ•¬è¯·å…³æ³¨~
-
-å¦‚æžœæ‚¨ä½¿ç”¨è¯¥Benchmarkï¼Œå¯ä»¥ä¸€å¹¶å¼•ç”¨LongBenchæ‰€åŸºäºŽçš„æ•°æ®é›†å¯¹åº”çš„è®ºæ–‡ï¼Œç›¸å…³å¼•ç”¨ä¿¡æ¯åˆ—ä¸¾åœ¨[è¿™é‡Œ](refs/ref.bib)ã€‚
+| ----------------- | :---------: | :----: | :-------------: | :-------------: |
+| GPT-3.5-Turbo-16k |    23.6     |  43.3  |      52.3       |      61.2       |
+| Llama2-7B-chat-4k |    19.1     |  19.6  |      35.8       |      11.6       |
+| LongChat-7B-16k   |    21.6     |  21.6  |      44.6       |      26.6       |
+| XGen-7B-8k        |    17.9     |  18.3  |      37.2       |      14.2       |
+| InternLM-7B-8k    |    12.4     |  16.8  |      22.3       |      33.3       |
+| ChatGLM2-6B       |    11.2     |  23.7  |      34.2       |      33.0       |
+| ChatGLM2-6B-32k   |    20.4     |  32.2  |      45.7       |      52.0       |
+
+#### Multi-Document QA
+
+|                   | HotpotQA | 2WikiMQA | Musique | DuReader (zh) |
+| ----------------- | :------: | :------: | :-----: | :-----------: |
+| GPT-3.5-Turbo-16k |   51.6   |   37.7   |  26.9   |     28.7      |
+| Llama2-7B-chat-4k |   24.3   |   31.4   |   8.6   |      1.9      |
+| LongChat-7B-16k   |   22.4   |   16.8   |   9.1   |     19.1      |
+| XGen-7B-8k        |   28.3   |   21.5   |  11.5   |      9.1      |
+| InternLM-7B-8k    |   27.9   |   24.0   |  10.3   |      8.9      |
+| ChatGLM2-6B       |   20.2   |   19.6   |   5.3   |     15.2      |
+| ChatGLM2-6B-32k   |   44.9   |   34.9   |  22.2   |     34.3      |
+
+#### Summarization
+
+|                   | GovReport | QMSum | VCSUM (zh) |
+| :---------------- | :-------: | :---: | :--------: |
+| GPT-3.5-Turbo-16k |   29.5    | 23.4  |    16.0    |
+| Llama2-7B-chat-4k |   27.3    | 20.6  |    0.2     |
+| LongChat-7B-16k   |   28.4    | 23.2  |    14.0    |
+| XGen-7B-8k        |   27.8    | 21.7  |    1.5     |
+| InternLM-7B-8k    |    9.8    | 16.8  |    13.0    |
+| ChatGLM2-6B       |   23.7    | 22.2  |    14.6    |
+| ChatGLM2-6B-32k   |   33.3    | 23.9  |    16.3    |
+
+#### Few-shot Learning
+
+|                   | TREC |  NQ  | TriviaQA | LSHT (zh) |
+| ----------------- | :--: | :--: | :------: | :-------: |
+| GPT-3.5-Turbo-16k | 68.0 | 73.0 |   87.1   |   29.2    |
+| Llama2-7B-chat-4k | 60.5 | 31.4 |   59.7   |   19.8    |
+| LongChat-7B-16k   | 61.5 | 44.8 |   73.5   |   20.8    |
+| XGen-7B-8k        | 66.0 | 43.2 |   67.0   |   20.0    |
+| InternLM-7B-8k    | 49.0 | 47.6 |   61.6   |   15.5    |
+| ChatGLM2-6B       | 44.0 | 34.5 |   59.8   |   20.5    |
+| ChatGLM2-6B-32k   | 62.0 | 64.9 |   77.6   |   29.9    |
+
+#### Code Completion
+
+|                   | LCC  | RepoBench-P |
+| ----------------- | :--: | :---------: |
+| GPT-3.5-Turbo-16k | 54.7 |    54.3     |
+| Llama2-7B-chat-4k | 52.3 |    42.4     |
+| LongChat-7B-16k   | 59.2 |    54.7     |
+| XGen-7B-8k        | 38.8 |    37.3     |
+| InternLM-7B-8k    | 45.5 |    34.0     |
+| ChatGLM2-6B       | 48.4 |    43.7     |
+| ChatGLM2-6B-32k   | 55.4 |    50.0     |
+
+#### Synthetic Tasks
+
+|                   | PassageRetrieval-en | Passage Count | PassageRetrieval-zh |
+| ----------------- | :-----------------: | :-----------: | :-----------------: |
+| GPT-3.5-Turbo-16k |        71.0         |      4.5      |        77.5         |
+| Llama2-7B-chat-4k |         9.2         |      2.5      |         0.5         |
+| LongChat-7B-16k   |        24.0         |      4.5      |         4.8         |
+| XGen-7B-8k        |         9.0         |      2.2      |         4.2         |
+| InternLM-7B-8k    |         6.5         |      2.9      |         0.9         |
+| ChatGLM2-6B       |         3.2         |      2.1      |         5.5         |
+| ChatGLM2-6B-32k   |        77.5         |      2.0      |        62.5         |
+
+## Acknowledgements
+
+- Some of the tasks of **LongBench** are based on the datasets proposed by previous researchers, including [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), [DuReader](https://github.com/baidu/DuReader), [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), [QMSum](https://arxiv.org/pdf/2104.05938.pdf), [VCSUM](https://arxiv.org/abs/2305.05280), [TriviaQA](https://nlp.cs.washington.edu/triviaqa/), [NQ](https://ai.google.com/research/NaturalQuestions/), [TREC](https://aclanthology.org/C02-1150.pdf), [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf), [LCC](https://arxiv.org/abs/2306.14893) and [RepoBench-P](https://arxiv.org/abs/2306.03091).
+
+## Citation
+This work is jointly completed by **THUKEG** and **Zhipu AI**. The related paper is currently being written, and the citation information will be updated when it's ready. Please stay tuned~
+
+If you use this benchmark, you can also cite the papers corresponding to the datasets that LongBench is based on. The relevant citation information is listed [here](refs/ref.bib).
\ No newline at end of file
diff --git a/README_EN.md b/README_EN.md
deleted file mode 100644
index 629c756..0000000
--- a/README_EN.md
+++ /dev/null
@@ -1,187 +0,0 @@
-![](misc/logo.gif)
-<p align="center">
-    ðŸ¤— <a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank">HF Repo</a> â€¢ ðŸ“ƒ Paper coming soon!
-</p>
-
-é˜…è¯» [ä¸æ–‡ç‰ˆæœ¬](README.md)
-
-# LongBench: A Multilingual, Multitask Benchmark Tailored for Long Context Understanding
-
-**LongBench** is the first comprehensive dataset for multi-language, multi-task, and comprehensive assessment of **long text understanding** capabilities of large language models. In the context of the widespread attention to the multi-language capabilities of large models, LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models' multi-language capabilities in long texts. In addition, LongBench consists of twenty different tasks, covering key long-text application scenarios such as single-document QA, multi-document QA, summaries, few-shot learning, code completion, and synthesis tasks.
-
-We are fully aware of the potentially high costs involved in the model evaluation process, especially in the context of long-text scenarios (such as manual annotation costs or API call costs). Therefore, we have adopted a fully automated evaluation method, aimed at measuring and evaluating the model's ability to understand long texts at the lowest cost and most effectively.
-
-LongBench includes 13 English tasks, 5 Chinese tasks, and 2 code tasks, with the average length of most tasks ranging from 5k to 15k. From the main task categories, LongBench includes six types of tasks, namely multi-document QA, single-document QA, summaries, Few-shot learning, synthetic tasks, and code completion. For detailed statistics and construction methods of LongBench tasks, please refer [here](task_en.md).
-
-| Task Type | \#English Task | \#Chinese Task | \#Code Task |
-| :-------: | :--------------------: | :--------------------: | :------------------: |
-| Multi-document QA | 3 | 1 | - |
-| Single-document QA | 3 | 1 | - |
-| Summarization | 2 | 1 | - |
-| Few-shot learning | 3 | 1 | - |
-| Synthetic Tasks | 2 | 1 | - |
-| Code Completion | - | - | 2 |
-
-## Leaderboard
-Here is the average score (%) of all models on various major tasks in both Chinese and English languages under the Zero-shot scenario. Please refer to this [link](task_en.md) for the evaluation metrics used for each task.
-
-#### English
-|                   | Avg  | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks |
-| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: |
-| GPT-3.5-Turbo-16k | 45.5 |     39.8      |     38.7     |     26.5      |       76.0        |      54.5       |      37.8       |
-| Llama2-7B-chat-4k | 29.0 |     24.8      |     21.4     |     23.9      |       50.5        |      47.3       |       5.9       |
-| LongChat-7B-16k   | 33.7 |     29.3      |     16.1     |     25.8      |       59.9        |      57.0       |      14.2       |
-| XGen-7B-8k        | 28.7 |     24.5      |     20.4     |     24.8      |       58.7        |      38.0       |       5.6       |
-| InternLM-7B-8k    | 24.7 |     17.1      |     20.8     |     13.3      |       52.7        |      39.7       |       4.7       |
-| ChatGLM2-6B       | 26.0 |     23.1      |     15.0     |     22.9      |       46.1        |      46.1       |       2.7       |
-| ChatGLM2-6B-32k   | 42.7 |     32.8      |     34.0     |     28.6      |       68.1        |      52.7       |      39.8       |
-
-#### Chinese
-
-|                   | Avg  | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Code Completion | Synthetic Tasks |
-| ----------------- | :--: | :-----------: | :----------: | :-----------: | :---------------: | :-------------: | :-------------: |
-| GPT-3.5-Turbo-16k | 44.5 |     61.2      |     28.7     |     16.0      |       29.2        |      54.5       |      77.5       |
-| Llama2-7B-chat-4k | 13.5 |     11.6      |     1.9      |      0.2      |       19.8        |      47.3       |       0.5       |
-| LongChat-7B-16k   | 23.7 |     26.6      |     19.1     |     14.0      |       20.8        |      57.0       |       4.8       |
-| XGen-7B-8k        | 14.5 |     14.2      |     9.1      |      1.5      |       20.0        |      38.0       |       4.2       |
-| InternLM-7B-8k    | 18.6 |     33.3      |     8.9      |     13.0      |       15.5        |      39.7       |       0.9       |
-| ChatGLM2-6B       | 22.5 |     33.0      |     15.2     |     14.6      |       20.5        |      46.1       |       5.5       |
-| ChatGLM2-6B-32k   | 41.3 |     52.0      |     34.3     |     16.3      |       29.9        |      52.7       |      62.5       |
-
-#### Radar Chart of Long Text Task Capability 
-
-![](misc/radar.png)
-
-#### Variation of Abilities under Different Text Lengths
-To more specifically analyze the model's relative performance under different text lengths, the following chart shows the average relative scores on all tasks over different text length intervals.
-![](misc/curve.png)
-
-> Note: Assume that the model scores x on the data within a specific length range of a task, and y on all data of that task, then the model's **relative score** for that length range is (x/y-1). To better compare the trends of different models, we shift all the lines to 0 from 0-4k.
-
-## How to evaluate models on LongBench
-
-#### Loading Data
-You can download and load the **LongBench** data through the Hugging Face datasets ([ðŸ¤— HF Repo](https://huggingface.co/datasets/THUDM/LongBench)):
-```python
-from datasets import load_dataset
-
-datasets = ["hotpotqa", "2wikimqa", "musique", "dureader", "narrativeqa", "qasper", "multifieldqa_en", \
-    "multifieldqa_zh", "gov_report", "qmsum", "vcsum", "trec", "nq", "triviaqa", "lsht", "passage_count", \
-    "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"]
-
-for dataset in datasets:
-    data = load_dataset('THUDM/LongBench', dataset, split='test')
-```
-Alternatively, you can download the folder from [this link](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip) to load the data.
-
-#### Data Format
-
-All data in **LongBench** are standardized to the following format:
-
-```json
-{
-    "input": "The input/command for the task, usually short, such as questions in QA, queries in Few-shot tasks, etc.",
-    "context": "The long context text required for the task, such as documents, cross-file code, few-shot samples in Few-shot tasks",
-    "answers": "List composed of all standard answers",
-    "length": "Total length of the first three items of text (counted in characters for Chinese and words for English)",
-    "dataset": "The name of the dataset to which this piece of data belongs",
-    "language": "The language of this piece of data",
-    "all_classes": "All categories in classification tasks, null for non-classification tasks",
-    "_id": "Random id for each piece of data"
-}
-```
-
-#### Evaluation
-We provide an evaluation code using ChatGLM2-6B as an example. Firstly, run the [pred.py](pred.py) under the repository:
-```bash
-CUDA_VISIBLE_DEVICES=0 python pred.py
-```
-You can get the model outputs on all datasets in the `pred/` folder. After that, run the evaluation code of [eval.py](eval.py):
-```bash
-python eval.py
-```
-You can get the evaluation results on various datasets in `result.json`. Please note that we provide the input format suitable for each dataset and the maximum output length limit we summarized in `config/`. You can modify them during the evaluation to better suit the model you want to evaluate. After modification, when evaluating with [pred.py](pred.py), the data will be automatically organized according to the new format to get the corresponding model output.
-
-## Evaluation Result on Each Dataset
-
-The following tables display the Zero-shot evaluation results (%) of the model on all subtask datasets, where Chinese datasets are denoted by "zh" (please refer to this [link](task_en.md) for the evaluation metrics used for each task).
-
-#### Single-Document QA
-|                   | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh |
-| ----------------- | :---------: | :----: | :-------------: | :-------------: |
-| GPT-3.5-Turbo-16k |    23.6     |  43.3  |      52.3       |      61.2       |
-| Llama2-7B-chat-4k |    19.1     |  19.6  |      35.8       |      11.6       |
-| LongChat-7B-16k   |    21.6     |  21.6  |      44.6       |      26.6       |
-| XGen-7B-8k        |    17.9     |  18.3  |      37.2       |      14.2       |
-| InternLM-7B-8k    |    12.4     |  16.8  |      22.3       |      33.3       |
-| ChatGLM2-6B       |    11.2     |  23.7  |      34.2       |      33.0       |
-| ChatGLM2-6B-32k   |    20.4     |  32.2  |      45.7       |      52.0       |
-
-#### Multi-Document QA
-
-|                   | HotpotQA | 2WikiMQA | Musique | DuReader (zh) |
-| ----------------- | :------: | :------: | :-----: | :-----------: |
-| GPT-3.5-Turbo-16k |   51.6   |   37.7   |  26.9   |     28.7      |
-| Llama2-7B-chat-4k |   24.3   |   31.4   |   8.6   |      1.9      |
-| LongChat-7B-16k   |   22.4   |   16.8   |   9.1   |     19.1      |
-| XGen-7B-8k        |   28.3   |   21.5   |  11.5   |      9.1      |
-| InternLM-7B-8k    |   27.9   |   24.0   |  10.3   |      8.9      |
-| ChatGLM2-6B       |   20.2   |   19.6   |   5.3   |     15.2      |
-| ChatGLM2-6B-32k   |   44.9   |   34.9   |  22.2   |     34.3      |
-
-#### Summarization
-
-|                   | GovReport | QMSum | VCSUM (zh) |
-| :---------------- | :-------: | :---: | :--------: |
-| GPT-3.5-Turbo-16k |   29.5    | 23.4  |    16.0    |
-| Llama2-7B-chat-4k |   27.3    | 20.6  |    0.2     |
-| LongChat-7B-16k   |   28.4    | 23.2  |    14.0    |
-| XGen-7B-8k        |   27.8    | 21.7  |    1.5     |
-| InternLM-7B-8k    |    9.8    | 16.8  |    13.0    |
-| ChatGLM2-6B       |   23.7    | 22.2  |    14.6    |
-| ChatGLM2-6B-32k   |   33.3    | 23.9  |    16.3    |
-
-#### Few-shot Learning
-
-|                   | TREC |  NQ  | TriviaQA | LSHT (zh) |
-| ----------------- | :--: | :--: | :------: | :-------: |
-| GPT-3.5-Turbo-16k | 68.0 | 73.0 |   87.1   |   29.2    |
-| Llama2-7B-chat-4k | 60.5 | 31.4 |   59.7   |   19.8    |
-| LongChat-7B-16k   | 61.5 | 44.8 |   73.5   |   20.8    |
-| XGen-7B-8k        | 66.0 | 43.2 |   67.0   |   20.0    |
-| InternLM-7B-8k    | 49.0 | 47.6 |   61.6   |   15.5    |
-| ChatGLM2-6B       | 44.0 | 34.5 |   59.8   |   20.5    |
-| ChatGLM2-6B-32k   | 62.0 | 64.9 |   77.6   |   29.9    |
-
-#### Code Completion
-
-|                   | LCC  | RepoBench-P |
-| ----------------- | :--: | :---------: |
-| GPT-3.5-Turbo-16k | 54.7 |    54.3     |
-| Llama2-7B-chat-4k | 52.3 |    42.4     |
-| LongChat-7B-16k   | 59.2 |    54.7     |
-| XGen-7B-8k        | 38.8 |    37.3     |
-| InternLM-7B-8k    | 45.5 |    34.0     |
-| ChatGLM2-6B       | 48.4 |    43.7     |
-| ChatGLM2-6B-32k   | 55.4 |    50.0     |
-
-#### Synthetic Tasks
-
-|                   | PassageRetrieval-en | Passage Count | PassageRetrieval-zh |
-| ----------------- | :-----------------: | :-----------: | :-----------------: |
-| GPT-3.5-Turbo-16k |        71.0         |      4.5      |        77.5         |
-| Llama2-7B-chat-4k |         9.2         |      2.5      |         0.5         |
-| LongChat-7B-16k   |        24.0         |      4.5      |         4.8         |
-| XGen-7B-8k        |         9.0         |      2.2      |         4.2         |
-| InternLM-7B-8k    |         6.5         |      2.9      |         0.9         |
-| ChatGLM2-6B       |         3.2         |      2.1      |         5.5         |
-| ChatGLM2-6B-32k   |        77.5         |      2.0      |        62.5         |
-
-## Acknowledgements
-
-- Some of the tasks of **LongBench** are based on the datasets proposed by previous researchers, including [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), [DuReader](https://github.com/baidu/DuReader), [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), [QMSum](https://arxiv.org/pdf/2104.05938.pdf), [VCSUM](https://arxiv.org/abs/2305.05280), [TriviaQA](https://nlp.cs.washington.edu/triviaqa/), [NQ](https://ai.google.com/research/NaturalQuestions/), [TREC](https://aclanthology.org/C02-1150.pdf), [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf), [LCC](https://arxiv.org/abs/2306.14893) and [RepoBench-P](https://arxiv.org/abs/2306.03091).
-
-## Citation
-This work is jointly completed by **THUKEG** and **Zhipu AI**. The related paper is currently being written, and the citation information will be updated when it's ready. Please stay tuned~
-
-If you use this benchmark, you can also cite the papers corresponding to the datasets that LongBench is based on. The relevant citation information is listed [here](refs/ref.bib).
\ No newline at end of file
diff --git a/README_ZH.md b/README_ZH.md
new file mode 100644
index 0000000..3eff75c
--- /dev/null
+++ b/README_ZH.md
@@ -0,0 +1,175 @@
+![](misc/logo.gif)
+<p align="center">
+    ðŸ¤— <a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank">HF Repo</a> â€¢ ðŸ“ƒ Paper coming soon!
+</p>
+
+Read this in [English](README.md).
+
+# LongBench: å¤šè¯è¨€ã€å¤šä»»åŠ¡é•¿æ–‡æœ¬ç†è§£è¯„æµ‹åŸºå‡†
+
+**LongBench**æ˜¯ç¬¬ä¸€ä¸ªå¤šè¯è¨€ã€å¤šä»»åŠ¡ã€å…¨é¢è¯„ä¼°å¤§è¯è¨€æ¨¡åž‹**é•¿æ–‡æœ¬ç†è§£**èƒ½åŠ›çš„ç»¼åˆæ•°æ®é›†ã€‚åœ¨ç›®å‰å¤§æ¨¡åž‹å¤šè¯è¨€èƒ½åŠ›å¼•èµ·å¹¿æ³›å…³æ³¨çš„èƒŒæ™¯ä¸‹ï¼ŒLongBenchæ¶µç›–äº†ä¸åŒçš„è¯è¨€ï¼ˆä¸æ–‡å’Œè‹±æ–‡ï¼‰ï¼Œä»¥æ¤æ¥å¯¹å¤§æ¨¡åž‹åœ¨é•¿æ–‡æœ¬ä¸‹çš„å¤šè¯è¨€èƒ½åŠ›è¿›è¡Œæ›´å…¨é¢çš„è¯„ä¼°ã€‚åŒæ—¶ï¼ŒLongBenchç”±äºŒåä¸ªä¸åŒçš„ä»»åŠ¡ç»„æˆï¼Œè¦†ç›–äº†å•æ–‡æ¡£QAã€å¤šæ–‡æ¡£QAã€æ‘˜è¦ã€Few-shotå¦ä¹ ã€ä»£ç è¡¥å…¨å’Œåˆæˆä»»åŠ¡ç‰å…³é”®çš„é•¿æ–‡æœ¬åº”ç”¨åœºæ™¯ã€‚
+
+æˆ‘ä»¬æ·±çŸ¥æ¨¡åž‹è¯„æµ‹è¿‡ç¨‹ä¸å¯èƒ½äº§ç”Ÿçš„é«˜æ˜‚æˆæœ¬ï¼Œå°¤å…¶æ˜¯é•¿æ–‡æœ¬åœºæ™¯ä¸‹ï¼ˆå¦‚äººå·¥æ ‡æ³¨æˆæœ¬æˆ–APIè°ƒç”¨æˆæœ¬ï¼‰ã€‚å› æ¤ï¼Œæˆ‘ä»¬é‡‡ç”¨äº†ä¸€ç§å…¨è‡ªåŠ¨çš„è¯„æµ‹æ–¹å¼ï¼Œæ—¨åœ¨ä»¥æœ€ä½Žçš„æˆæœ¬ï¼Œæœ€æœ‰æ•ˆåœ°è¡¡é‡å’Œè¯„ä¼°æ¨¡åž‹çš„é•¿æ–‡æœ¬ç†è§£èƒ½åŠ›ã€‚
+
+LongBenchåŒ…å«13ä¸ªè‹±æ–‡ä»»åŠ¡ã€5ä¸ªä¸æ–‡ä»»åŠ¡å’Œ2ä¸ªä»£ç ä»»åŠ¡ï¼Œå¤šæ•°ä»»åŠ¡çš„å¹³å‡é•¿åº¦åœ¨5k-15kä¹‹é—´ï¼Œå…±åŒ…å«çº¦4500æ¡æµ‹è¯•æ•°æ®ã€‚ä»Žä¸»è¦ä»»åŠ¡åˆ†ç±»ä¸Šï¼ŒLongBenchåŒ…å«å…å¤§ç±»ä»»åŠ¡ï¼Œå³å•æ–‡æ¡£QAã€å¤šæ–‡æ¡£QAã€æ‘˜è¦ã€Few-shotå¦ä¹ ã€ä»£ç è¡¥å…¨å’Œåˆæˆä»»åŠ¡ã€‚å…³äºŽLongBenchæ•°æ®é›†çš„å…·ä½“ç»Ÿè®¡åŠä»»åŠ¡æž„é€ æ–¹å¼è¯·å‚è€ƒ[è¿™é‡Œ](task_zh.md)ã€‚
+
+|   ä»»åŠ¡ç±»åž‹   | è‹±æ–‡ä»»åŠ¡æ•° | ä¸æ–‡ä»»åŠ¡æ•° | ä»£ç ä»»åŠ¡æ•° |
+| :----------: | :--------: | :--------: | :--------: |
+|   å•æ–‡æ¡£QA   |     3      |     1      |     -      |
+|   å¤šæ–‡æ¡£QA   |     3      |     1      |     -      |
+|     æ‘˜è¦     |     2      |     1      |     -      |
+| Few-shotå¦ä¹  |     3      |     1      |     -      |
+|   åˆæˆä»»åŠ¡   |     2      |     1      |     -      |
+|   ä»£ç è¡¥å…¨   |     -      |     -      |     2      |
+
+## æŽ’è¡Œæ¦œ
+æˆ‘ä»¬åœ¨è¿™é‡Œå±•ç¤ºäº†æ‰€æœ‰æ¨¡åž‹åœ¨Zero-shotåœºæ™¯ä¸‹ï¼Œåœ¨ä¸æ–‡å’Œè‹±æ–‡å„å¤§ç±»ä»»åŠ¡ä¸Šå¾—åˆ†çš„å¹³å‡å€¼ï¼ˆ%ï¼‰ï¼Œå„ä»»åŠ¡è¯„ä¼°æ‰€ç”¨æŒ‡æ ‡è¯·å‚è€ƒ[è¿™é‡Œ](task_zh.md)ã€‚
+#### è‹±æ–‡æ¦œå•
+|     | Avg | å•æ–‡æ¡£QA | å¤šæ–‡æ¡£QA | æ‘˜è¦ | Few-shotå¦ä¹  | ä»£ç è¡¥å…¨ | åˆæˆä»»åŠ¡ |
+| --- | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
+| GPT-3.5-Turbo-16k | 45.5 | 39.8 | 38.7 | 26.5 | 76.0 | 54.5 | 37.8 |
+| Llama2-7B-chat-4k | 29.0 | 24.8 | 21.4 | 23.9 | 50.5 | 47.3 | 5.9 |
+| LongChat-7B-16k | 33.7 | 29.3 | 16.1 | 25.8 | 59.9 | 57.0 | 14.2 |
+| XGen-7B-8k | 28.7 | 24.5 | 20.4 | 24.8 | 58.7 | 38.0 | 5.6 |
+| InternLM-7B-8k | 24.7 | 17.1 | 20.8 | 13.3 | 52.7 | 39.7 | 4.7 |
+| ChatGLM2-6B | 26.0 | 23.1 | 15.0 | 22.9 | 46.1 | 46.1 | 2.7 |
+| ChatGLM2-6B-32k | 42.7 | 32.8 | 34.0 | 28.6 | 68.1 | 52.7 | 39.8 |
+
+#### ä¸æ–‡æ¦œå•
+|       | Avg | å•æ–‡æ¡£QA | å¤šæ–‡æ¡£QA | æ‘˜è¦ | Few-shotå¦ä¹  | ä»£ç è¡¥å…¨ | åˆæˆä»»åŠ¡ |
+|-------|:---:|:-------------:|:------------:|:-------------:|:-----------------:|:---------------:|:----------------:|
+| GPT-3.5-Turbo-16k | 44.5 | 61.2 | 28.7 | 16.0 | 29.2 | 54.5 | 77.5 |
+| Llama2-7B-chat-4k | 13.5 | 11.6 | 1.9 | 0.2 | 19.8 | 47.3 | 0.5 |
+| LongChat-7B-16k | 23.7 | 26.6 | 19.1 | 14.0 | 20.8 | 57.0 | 4.8 |
+| XGen-7B-8k | 14.5 | 14.2 | 9.1 | 1.5 | 20.0 | 38.0 | 4.2 |
+| InternLM-7B-8k | 18.6 | 33.3 | 8.9 | 13.0 | 15.5 | 39.7 | 0.9 |
+| ChatGLM2-6B | 22.5 | 33.0 | 15.2 | 14.6 | 20.5 | 46.1 | 5.5 |
+| ChatGLM2-6B-32k | 41.3 | 52.0 | 34.3 | 16.3 | 29.9 | 52.7 | 62.5 |
+
+#### é•¿æ–‡æœ¬ä»»åŠ¡èƒ½åŠ›é›·è¾¾å›¾
+![](misc/radar.png)
+
+#### ä¸åŒé•¿åº¦æ–‡æœ¬ä¸‹çš„èƒ½åŠ›å˜åŒ–
+ä¸ºäº†æ›´æœ‰é’ˆå¯¹æ€§åœ°åˆ†æžæ¨¡åž‹åœ¨ä¸åŒæ–‡æœ¬é•¿åº¦ä¸‹çš„ç›¸å¯¹è¡¨çŽ°ï¼Œä¸‹å›¾å±•ç¤ºäº†æ¨¡åž‹åœ¨ä¸åŒæ–‡æœ¬é•¿åº¦åŒºé—´ä¸Šï¼Œæ‰€æœ‰ä»»åŠ¡ä¸Šçš„å¹³å‡ç›¸å¯¹åˆ†æ•°ã€‚
+![](misc/curve.png)
+
+> æ³¨ï¼šå‡è®¾æ¨¡åž‹åœ¨æŸä¸ªä»»åŠ¡çš„ç‰¹å®šé•¿åº¦èŒƒå›´å†…æ•°æ®ä¸Šå¾—åˆ†ä¸ºxï¼Œåœ¨è¯¥ä»»åŠ¡æ‰€æœ‰æ•°æ®ä¸Šå¾—åˆ†ä¸ºyï¼Œåˆ™æ¨¡åž‹åœ¨è¯¥é•¿åº¦èŒƒå›´çš„**ç›¸å¯¹åˆ†æ•°**ä¸º(x/y-1)ã€‚ä¸ºäº†æ›´å¥½æ¯”è¾ƒä¸åŒæ¨¡åž‹çš„å˜åŒ–è¶‹åŠ¿ï¼Œæˆ‘ä»¬åœ¨0-4kå°†æ‰€æœ‰æŠ˜çº¿å¹³ç§»è‡³0ã€‚
+
+## å¦‚ä½•åœ¨LongBenchä¸Šè¯„æµ‹æ¨¡åž‹
+
+#### è½½å…¥æ•°æ®
+ä½ å¯ä»¥é€šè¿‡Hugging Face datasetsæ¥ä¸‹è½½å¹¶è½½å…¥**LongBench**çš„æ•°æ®ï¼ˆ[ðŸ¤— HF Repo](https://huggingface.co/datasets/THUDM/LongBench)ï¼‰:
+```python
+from datasets import load_dataset
+
+datasets = ["hotpotqa", "2wikimqa", "musique", "dureader", "narrativeqa", "qasper", "multifieldqa_en", \
+    "multifieldqa_zh", "gov_report", "qmsum", "vcsum", "trec", "nq", "triviaqa", "lsht", "passage_count", \
+    "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"]
+
+for dataset in datasets:
+    data = load_dataset('THUDM/LongBench', dataset, split='test')
+```
+åŒæ ·åœ°ï¼Œä½ ä¹Ÿå¯ä»¥ç›´æŽ¥ç”¨è¿™ä¸ª[é“¾æŽ¥](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip)ä¸‹è½½æ‰€æœ‰çš„è¯„æµ‹æ•°æ®ã€‚
+
+#### æ•°æ®æ ¼å¼
+**LongBench**ä¸æ‰€æœ‰æ•°æ®éƒ½ç»Ÿä¸€ä¸ºä»¥ä¸‹æ ¼å¼ï¼š
+```json
+{
+    "input": "ä»»åŠ¡çš„è¾“å…¥/æŒ‡ä»¤ï¼Œé€šå¸¸è¾ƒçŸï¼Œæ¯”å¦‚QAä¸çš„é—®é¢˜ã€Few-shotä»»åŠ¡ä¸çš„æé—®ç‰",
+    "context": "ä»»åŠ¡æ‰€éœ€çš„é•¿è¯å¢ƒæ–‡æœ¬ï¼Œæ¯”å¦‚æ–‡æ¡£ã€è·¨æ–‡ä»¶ä»£ç ã€Few-shotä»»åŠ¡ä¸çš„few-shotæ ·æœ¬",
+    "answers": "ç”±æ‰€æœ‰æ ‡å‡†ç”æ¡ˆç»„æˆçš„åˆ—è¡¨",
+    "length": "å‰ä¸‰é¡¹æ–‡æœ¬çš„æ€»é•¿åº¦ï¼ˆä¸ã€è‹±æ–‡åˆ†åˆ«ç”¨å—ã€è¯æ•°ç»Ÿè®¡ï¼‰",
+    "dataset": "æœ¬æ¡æ•°æ®æ‰€å±žæ•°æ®é›†åç§°",
+    "language": "æœ¬æ¡æ•°æ®çš„è¯è¨€",
+    "all_classes": "åˆ†ç±»ä»»åŠ¡ä¸çš„æ‰€æœ‰ç±»åˆ«ï¼Œéžåˆ†ç±»ä»»åŠ¡åˆ™ä¸ºnull",
+    "_id": "æ¯æ¡æ•°æ®çš„éšæœºid"
+}
+```
+
+#### è¯„æµ‹
+æˆ‘ä»¬ä»¥ChatGLM2-6Bä¸ºä¾‹æä¾›äº†ä¸€ä»½è¯„æµ‹ä»£ç ã€‚é¦–å…ˆï¼Œè¿è¡Œä»“åº“ä¸‹çš„[pred.py](pred.py)
+```bash
+CUDA_VISIBLE_DEVICES=0 python pred.py
+```
+å¯ä»¥åœ¨`pred/`æ–‡ä»¶å¤¹ä¸‹å¾—åˆ°æ¨¡åž‹åœ¨æ‰€æœ‰æ•°æ®é›†ä¸‹çš„è¾“å‡ºï¼Œæ¤åŽè¿è¡Œ[eval.py](eval.py)çš„è¯„æµ‹ä»£ç ï¼š
+```bash
+python eval.py
+```
+å¯ä»¥åœ¨`result.json`ä¸å¾—åˆ°åœ¨å„æ•°æ®é›†ä¸Šçš„è¯„æµ‹ç»“æžœã€‚è¯·æ³¨æ„ï¼Œæˆ‘ä»¬åœ¨`config/`ä¸‹æä¾›äº†æˆ‘ä»¬æ€»ç»“å‡ºæ¥çš„åœ¨å„æ•°æ®é›†ä¸Šé€‚åˆçš„è¾“å…¥æ ¼å¼å’Œæœ€å¤§è¾“å‡ºé•¿åº¦é™åˆ¶ï¼Œåœ¨è¯„æµ‹çš„æ—¶å€™å¯ä»¥è¿›è¡Œä¿®æ”¹ä»¥æ›´å¥½åœ°é€‚ç”¨ä½ è¦è¯„æµ‹çš„æ¨¡åž‹ï¼Œä¿®æ”¹åŽåœ¨[pred.py](pred.py)è¯„æµ‹æ—¶ä¼šè‡ªåŠ¨æŒ‰ç…§æ–°çš„æ ¼å¼åŽ»æ•´ç†æ•°æ®å¹¶å¾—åˆ°å¯¹åº”çš„æ¨¡åž‹è¾“å‡ºã€‚
+
+## è¯¦ç»†è¯„æµ‹ç»“æžœ
+ä¸‹é¢çš„å‡ å¼ è¡¨æ ¼å±•ç¤ºäº†æ¨¡åž‹åœ¨æ‰€æœ‰åä»»åŠ¡æ•°æ®é›†ä¸Šçš„Zero-shotè¯„æµ‹ç»“æžœï¼ˆ%ï¼‰ï¼Œå…¶ä¸çš„ä¸æ–‡æ•°æ®é›†ä»¥â€œzhâ€æ ‡ç¤ºï¼ˆå„ä»»åŠ¡è¯„ä¼°æ‰€ç”¨æŒ‡æ ‡è¯·å‚è€ƒ[è¿™é‡Œ](task_zh.md)ï¼‰ã€‚
+> æ³¨ï¼šå¯¹äºŽè¶…å‡ºæ¨¡åž‹å¤„ç†é•¿åº¦èƒ½åŠ›çš„æ–‡æœ¬ï¼Œå‚è€ƒ[Lost in the Middle](https://arxiv.org/abs/2307.03172)çš„è§‚å¯Ÿï¼Œæˆ‘ä»¬ä»Žæ–‡æœ¬ä¸é—´è¿›è¡Œæˆªæ–ï¼Œä¿æŒå‰åŽéƒ¨åˆ†çš„ä¿¡æ¯ã€‚å®žéªŒè¡¨æ˜Žï¼Œè¿™ç§æˆªæ–æ–¹å¼å¯¹æ¨¡åž‹æ€§èƒ½å½±å“æœ€å°ã€‚
+#### å•æ–‡æ¡£QA
+|                   | NarrativeQA | Qasper | MultiFieldQA-en | MultiFieldQA-zh |
+|-------------------|:-----------:|:------:|:---------------:|:---------------:|
+| GPT-3.5-Turbo-16k | 23.6 | 43.3 | 52.3 | 61.2 |
+| Llama2-7B-chat-4k | 19.1 | 19.6 | 35.8 | 11.6 |
+| LongChat-7B-16k | 21.6 | 21.6 | 44.6 | 26.6 |
+| XGen-7B-8k | 17.9 | 18.3 | 37.2 | 14.2 |
+| InternLM-7B-8k | 12.4 | 16.8 | 22.3 | 33.3 |
+| ChatGLM2-6B | 11.2 | 23.7 | 34.2 | 33.0 |
+| ChatGLM2-6B-32k | 20.4 | 32.2 | 45.7 | 52.0 |
+
+#### å¤šæ–‡æ¡£QA
+|                      | HotpotQA | 2WikiMQA | Musique | DuReader (zh) |
+|----------------------|:--------:|:--------:|:-------:|:--------:|
+| GPT-3.5-Turbo-16k | 51.6 | 37.7 | 26.9 | 28.7 |
+| Llama2-7B-chat-4k | 24.3 | 31.4 | 8.6 | 1.9 |
+| LongChat-7B-16k | 22.4 | 16.8 | 9.1 | 19.1 |
+| XGen-7B-8k | 28.3 | 21.5 | 11.5 | 9.1 |
+| InternLM-7B-8k | 27.9 | 24.0 | 10.3 | 8.9 |
+| ChatGLM2-6B | 20.2 | 19.6 | 5.3 | 15.2 |
+| ChatGLM2-6B-32k | 44.9 | 34.9 | 22.2 | 34.3 |
+
+#### æ‘˜è¦
+|            | GovReport | QMSum | VCSUM (zh) |
+|:-----------|:---------:|:-----:|:-----:|
+| GPT-3.5-Turbo-16k | 29.5 | 23.4 | 16.0 |
+| Llama2-7B-chat-4k | 27.3 | 20.6 | 0.2 |
+| LongChat-7B-16k | 28.4 | 23.2 | 14.0 |
+| XGen-7B-8k | 27.8 | 21.7 | 1.5 |
+| InternLM-7B-8k | 9.8 | 16.8 | 13.0 |
+| ChatGLM2-6B | 23.7 | 22.2 | 14.6 |
+| ChatGLM2-6B-32k | 33.3 | 23.9 | 16.3 |
+
+#### Few-shotå¦ä¹ 
+|     | TREC | NQ | TriviaQA | LSHT (zh) |
+| --- | :-: | :-: | :-: | :-: |
+| GPT-3.5-Turbo-16k | 68.0 | 73.0 | 87.1 | 29.2 |
+| Llama2-7B-chat-4k | 60.5 | 31.4 | 59.7 | 19.8 |
+| LongChat-7B-16k | 61.5 | 44.8 | 73.5 | 20.8 |
+| XGen-7B-8k | 66.0 | 43.2 | 67.0 | 20.0 |
+| InternLM-7B-8k | 49.0 | 47.6 | 61.6 | 15.5 |
+| ChatGLM2-6B | 44.0 | 34.5 | 59.8 | 20.5 |
+| ChatGLM2-6B-32k | 62.0 | 64.9 | 77.6 | 29.9 |
+
+#### ä»£ç è¡¥å…¨
+|     | LCC | RepoBench-P |
+| --- | :-: | :-: |
+| GPT-3.5-Turbo-16k | 54.7 | 54.3 |
+| Llama2-7B-chat-4k | 52.3 | 42.4 |
+| LongChat-7B-16k | 59.2 | 54.7 |
+| XGen-7B-8k | 38.8 | 37.3 |
+| InternLM-7B-8k | 45.5 | 34.0 |
+| ChatGLM2-6B | 48.4 | 43.7 |
+| ChatGLM2-6B-32k | 55.4 | 50.0 |
+
+#### åˆæˆä»»åŠ¡
+|     | PassageRetrieval-en | Passage Count | PassageRetrieval-zh |
+| --- | :-: | :-: | :-: |
+| GPT-3.5-Turbo-16k | 71.0 | 4.5 | 77.5 |
+| Llama2-7B-chat-4k | 9.2 | 2.5 | 0.5 |
+| LongChat-7B-16k | 24.0 | 4.5 | 4.8 |
+| XGen-7B-8k | 9.0 | 2.2 | 4.2 |
+| InternLM-7B-8k | 6.5 | 2.9 | 0.9 |
+| ChatGLM2-6B | 3.2 | 2.1 | 5.5 |
+| ChatGLM2-6B-32k | 77.5 | 2.0 | 62.5 |
+
+## è‡´è°¢
+- **LongBench**çš„éƒ¨åˆ†ä»»åŠ¡åŸºäºŽä¹‹å‰çš„ç ”ç©¶è€…æå‡ºçš„æ•°æ®é›†æž„å»ºï¼ŒåŒ…æ‹¬[HotpotQA](https://hotpotqa.github.io/)ï¼Œ[2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/)ï¼Œ[Musique](https://arxiv.org/abs/2108.00573)ï¼Œ[DuReader](https://github.com/baidu/DuReader)ï¼Œ[NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf)ï¼Œ[Qasper](https://arxiv.org/pdf/2105.03011.pdf)ï¼Œ[GovReport](https://arxiv.org/pdf/2104.02112.pdf)ï¼Œ[QMSum](https://arxiv.org/pdf/2104.05938.pdf)ï¼Œ[VCSUM](https://arxiv.org/abs/2305.05280)ï¼Œ[TriviaQA](https://nlp.cs.washington.edu/triviaqa/)ï¼Œ[NQ](https://ai.google.com/research/NaturalQuestions/)ï¼Œ[TREC](https://aclanthology.org/C02-1150.pdf)ï¼Œ[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf)ï¼Œ[LCC](https://arxiv.org/abs/2306.14893)å’Œ[RepoBench-P](https://arxiv.org/abs/2306.03091)ã€‚
+
+## å¼•ç”¨
+æœ¬å·¥ä½œç”±**THUKEG**å’Œ**Zhipu AI**å…±åŒå®Œæˆï¼Œç›¸å…³è®ºæ–‡æ£åœ¨æ’°å†™ä¸ï¼Œå±Šæ—¶å°†æ›´æ–°å¼•ç”¨ä¿¡æ¯ï¼Œæ•¬è¯·å…³æ³¨~
+
+å¦‚æžœæ‚¨ä½¿ç”¨è¯¥Benchmarkï¼Œå¯ä»¥ä¸€å¹¶å¼•ç”¨LongBenchæ‰€åŸºäºŽçš„æ•°æ®é›†å¯¹åº”çš„è®ºæ–‡ï¼Œç›¸å…³å¼•ç”¨ä¿¡æ¯åˆ—ä¸¾åœ¨[è¿™é‡Œ](refs/ref.bib)ã€‚
diff --git a/task.md b/task.md
index 67a6a0d..186164f 100644
--- a/task.md
+++ b/task.md
@@ -1,66 +1,67 @@
-# ä»»åŠ¡ç»Ÿè®¡
+# Task statistics
 
-| ä»»åŠ¡              |      ä»»åŠ¡ç±»åž‹  |      è¯„ä»·æŒ‡æ ‡  |     å¹³å‡é•¿åº¦                                       |è¯è¨€ | Sampleæ•°é‡|
-| :--------- | :-----------:| :-----------: |:---------: | :-------------: |:---------: |
-| HotpotQA   | å¤šæ–‡æ¡£QA       | F1                        |9,149                           |è‹±æ–‡                           |200                           |
-| 2WikiMultihopQA| å¤šæ–‡æ¡£QA | F1                        |4,885                           |è‹±æ–‡                           |200                           |
-| Musique| å¤šæ–‡æ¡£QA   | F1                        |7,798                           |è‹±æ–‡                           |200                           |
-| DuReader| å¤šæ–‡æ¡£QA  | Rouge-L                 |15,768                           |ä¸æ–‡                           |200                           |
-| MultiFieldQA-en| å•æ–‡æ¡£QA | F1                        |4,559                           |è‹±æ–‡                           |150                           |
-| MultiFieldQA-zh| å•æ–‡æ¡£QA | F1                        |6,771                           |ä¸æ–‡                           |200                           |
-| NarrativeQA| å•æ–‡æ¡£QA | F1                        |18,405                           |è‹±æ–‡                           |200                           |
-| Qasper| å•æ–‡æ¡£QA    | F1                        |3,619                           |è‹±æ–‡                           |200                           |
-| GovReport| æ‘˜è¦ | Rouge-L                 |8,169                           |è‹±æ–‡                           |200                           |
-| QMSum| æ‘˜è¦     | Rouge-L                 |10,546                           |è‹±æ–‡                           |200                           |
-| VCSUM| æ‘˜è¦     | Rouge-L                 |15,147                           |ä¸æ–‡                           |200                           |
-| TriviaQA| Few shot  | F1                        |8,015                           |è‹±æ–‡                           |200                           |
-| NQ| Few shot | F1                        |8,210                           |è‹±æ–‡                           |200                           |
-| TREC| Few shot | Accuracy                |5,176                           |è‹±æ–‡                           |200                           |
-| LSHT| Few shot | Accuracy                |22,333                           |ä¸æ–‡                           |200                           |
-| PassageRetrieval-en| åˆæˆä»»åŠ¡ | Accuracy                |9,288                           |è‹±æ–‡                           |200                           |
-| PassageCount| åˆæˆä»»åŠ¡ | Accuracy                |11,141                           |è‹±æ–‡                           |200                           |
-| PassageRetrieval-zh | åˆæˆä»»åŠ¡ | Accuracy                |6,745                           |ä¸æ–‡                           |200                           |
-| LCC| ä»£ç  | Edit Sim              |1,235                           |Python/C#/Java                           |500                           |
-| RepoBench-P| ä»£ç  | Edit Sim                |5,622                           |Python/Java                           |500                           |
+| Task          | Task Type | Eval metric |     Avg len                            |Language | \#Sample |
+| --------- | -------------| ------------- |--------- | ------------- |--------- |
+| HotpotQA   | Multi-doc QA | F1                        |9149                           |EN                           |200                           |
+| 2WikiMultihopQA| Multi-doc QA | F1                        |4885                           |EN                           |200                           |
+| Musique| Multi-doc QA | F1                        |7798                           |EN                           |200                           |
+| DuReader| Multi-doc QA | Rouge-L                 |15768                           |ZH                           |200                           |
+| MultiFieldQA-en| Single-doc QA | F1                        |4559                           |EN                           |150                           |
+| MultiFieldQA-zh| Single-doc QA | F1                        |6771                           |ZH                           |200                           |
+| NarrativeQA| Single-doc QA | F1                        |18405                           |EN                           |200                           |
+| Qasper| Single-doc QA | F1                        |3619                           |EN                           |200                           |
+| GovReport| Summarization | Rouge-L                 |8169                           |EN                           |200                           |
+| QMSum| Summarization | Rouge-L                 |10546                           |EN                           |200                           |
+| VCSUM| Summarization | Rouge-L                 |15147                           |ZH                           |200                           |
+| TriviaQA| Few shot  | F1                        |8015                           |EN                           |200                           |
+| NQ| Few shot | F1                        |8210                           |EN                           |200                           |
+| TREC| Few shot | Accuracy                |5176                           |EN                           |200                           |
+| LSHT| Few shot | Accuracy                |22333                           |ZH                           |200                           |
+| PassageRetrieval-en| Synthetic | Accuracy                |9288                           |EN                           |200                           |
+| PassageCount| Synthetic | Accuracy                |11141                           |EN                           |200  |
+| PassageRetrieval-zh | Synthetic | Accuracy                |6745                           |ZH                           |200                           |
+| LCC| Code | Edit Sim              |1235                           |Python/C#/Java                           |500                           |
+| RepoBench-P| Code | Edit Sim                |5622                           |Python/Java                           |500                           |
 
-> æ³¨ï¼šä¸ºäº†é¿å…ä¸åŒTokenizerç»Ÿè®¡çš„å·®è·ï¼Œæˆ‘ä»¬ä½¿ç”¨å•è¯æ•°ï¼ˆPythonçš„splitå‡½æ•°ï¼‰æ¥ç»Ÿè®¡è‹±æ–‡æ•°æ®é›†å’Œä»£ç æ•°æ®é›†çš„å¹³å‡é•¿åº¦ï¼Œä½¿ç”¨æ±‰å—æ•°æ¥ç»Ÿè®¡ä¸æ–‡æ•°æ®é›†çš„å¹³å‡é•¿åº¦ã€‚
+> Note: In order to avoid discrepancies caused by different tokenizers, we use the word count (using Python's split function) to calculate the average length of English datasets and code datasets, and use the character count to calculate the average length of Chinese datasets.
 
-# ä»»åŠ¡è¯´æ˜Ž
+# Task description
 
-| ä»»åŠ¡              | ä»»åŠ¡è¯´æ˜Ž                                                     |
-| :----------------- | :----------------------------------------------------------- |
-| HotpotQA          | åŸºäºŽå¤šç¯‡ç»™å®šçš„æ–‡æ¡£ï¼Œå›žç”ç›¸å…³é—®é¢˜                             |
-| 2WikiMultihopQA   | åŸºäºŽå¤šç¯‡ç»™å®šçš„æ–‡æ¡£ï¼Œå›žç”ç›¸å…³é—®é¢˜                             |
-| Musique           | åŸºäºŽå¤šç¯‡ç»™å®šçš„æ–‡æ¡£ï¼Œå›žç”ç›¸å…³é—®é¢˜                             |
-| DuReader          | åŸºäºŽå¤šç¯‡ç»™å®šçš„æ£€ç´¢æ–‡æ¡£ï¼Œå›žç”ç›¸å…³çš„ä¸æ–‡é—®é¢˜                   |
-| MultiFieldQA-en   | åŸºäºŽå•ç¯‡æ–‡æ¡£ï¼Œå›žç”è‹±æ–‡é—®é¢˜ï¼Œæ–‡æ¡£æ‰€å±žçš„é¢†åŸŸç›¸å¯¹å¤šå…ƒ           |
-| MultiFieldQA-zh   | åŸºäºŽå•ç¯‡æ–‡æ¡£ï¼Œå›žç”ä¸æ–‡é—®é¢˜ï¼Œæ–‡æ¡£æ‰€å±žçš„é¢†åŸŸç›¸å¯¹å¤šå…ƒ           |
-| NarrativeQA       | åŸºäºŽæ•…äº‹æˆ–å‰§æœ¬æé—®ï¼ŒåŒ…æ‹¬å¯¹äººç‰©ã€æƒ…èŠ‚ã€ä¸»é¢˜ç‰é‡è¦å…ƒç´ çš„ç†è§£   |
-| Qasper            | åŸºäºŽå•ç¯‡è®ºæ–‡çš„æå‡ºï¼Œé—®é¢˜ç”±NLPçš„è¯»è€…æå‡ºï¼Œå¹¶ç”±NLPä»Žä¸šè€…å›žç”   |
-| GovReport         | æ‘˜è¦ä»»åŠ¡ï¼Œè¦æ±‚å¯¹æ”¿åºœçš„å·¥ä½œæŠ¥å‘Šè¿›è¡Œæ€»ç»“æ‘˜è¦                   |
-| QMSum             | æ‘˜è¦ä»»åŠ¡ï¼Œè¦æ±‚åŸºäºŽç”¨æˆ·çš„æŸ¥è¯¢å¯¹ä¼šè®®è®°å½•è¿›è¡Œæ‘˜è¦               |
-| VCSUM             | æ‘˜è¦ä»»åŠ¡ï¼Œè¦æ±‚å¯¹ä¸æ–‡ä¼šè®®è®°å½•è¿›è¡Œæ€»ç»“æ‘˜è¦                     |
-| TriviaQA          | å•æ–‡æ¡£é—®ç”ä»»åŠ¡ï¼Œæä¾›è‹¥å¹²çš„Few Shotæ ·ä¾‹                       |
-| NQ                | å•æ–‡æ¡£é—®ç”ä»»åŠ¡ï¼Œæä¾›è‹¥å¹²çš„Few Shotæ ·ä¾‹                       |
-| TREC              | åˆ†ç±»ä»»åŠ¡ï¼Œè¦æ±‚å¯¹é—®é¢˜è¿›è¡Œåˆ†ç±»ï¼Œä¸€å…±åŒ…å«50ä¸ªç±»åˆ«               |
-| LSHT              | ä¸æ–‡åˆ†ç±»ä»»åŠ¡ï¼Œè¦æ±‚å¯¹æ–°é—»è¿›è¡Œåˆ†ç±»ï¼Œä¸€å…±åŒ…å«24ä¸ªç±»åˆ«           |
-| PassageRetrieval-en | ç»™å®š30ä¸ªè‹±æ–‡ç»´åŸºçš„æ®µè½ï¼Œåˆ¤æ–ç»™å®šçš„æ‘˜è¦å±žäºŽå“ªä¸ªæ®µè½           |
-| PassageCount | åˆ¤æ–ç»™å®šçš„è‹¥å¹²çš„æ®µè½ä¸ä¸é‡å¤çš„æ®µè½ä¸€å…±æœ‰å‡ ä¸ª           |
-| PassageRetrieval-zh | ç»™å®šè‹¥å¹²ä¸ªå‡ºè‡ªC4æ•°æ®é›†çš„ä¸æ–‡æ®µè½ï¼Œåˆ¤æ–ç»™å®šçš„æ‘˜è¦å±žäºŽå“ªä¸ªæ®µè½ |
-| LCC               | ç»™å®šä¸€æ®µè¾ƒé•¿ä»£ç ï¼Œè¦æ±‚é¢„æµ‹å‡ºä¸‹ä¸€è¡Œä»£ç                        |
-| RepoBench-P       | ç»™å®šä¸€ä¸ªgithubä»“åº“å†…å¤šä¸ªæ–‡ä»¶ä¸çš„ä»£ç ï¼ˆåŒ…å«æ–‡ä»¶é—´ä¾èµ–ï¼‰ï¼Œè¦æ±‚é¢„æµ‹å‡ºä¸‹ä¸€è¡Œä»£ç  |
+| Task              | Task Description                                            |
+| ----------------- | ------------------------------------------------------------ |
+| HotpotQA          | Answer related questions based on multiple given documents   |
+| 2WikiMultihopQA   | Answer related questions based on multiple given documents   |
+| Musique           | Answer related questions based on multiple given documents   |
+| DuReader          | Answer related Chinese questions based on multiple retrieved documents |
+| MultiFieldQA-en   | Answer English questions based on a single document, which comes from a relatively diverse field |
+| MultiFieldQA-zh   | Answer Chinese questions based on a single document, which comes from a relatively diverse field |
+| NarrativeQA       | Ask questions based on stories or scripts, including understanding of important elements such as characters, plots, themes, etc. |
+| Qasper            | Ask questions based on a single paper, questions proposed by NLP readers, and answered by NLP practitioners |
+| GovReport         | A summarization task that requires summarizing government work reports |
+| QMSum             | A summarization task that requires summarizing meeting records based on user queries |
+| VCSUM             | A summarization task that requires summarizing Chinese meeting records |
+| TriviaQA          | Single document question answering task, providing several Few Shot examples |
+| NQ                | Single document question answering task, providing several Few Shot examples |
+| TREC              | A classification task that requires categorizing questions, includes 50 categories in total |
+| LSHT              | A Chinese classification task that requires categorizing news, includes 24 categories in total |
+| PassageRetrieval-en | Given 30 English Wikipedia paragraphs, determine which paragraph the given summary belongs to |
+| PassageCount | Determine the number of non-repeating paragraphs in a given number of paragraphs |
+| PassageRetrieval-zh | Given several Chinese paragraphs from the C4 data set, determine which paragraph the given abstract belongs to |
+| LCC               | Given a longer piece of code, predict the next line of code |
+| RepoBench-P       | Given code in multiple files within a GitHub repository (including inter-file dependencies), predict the next line of code |
 
-# æ•°æ®æž„é€ æ–¹å¼
 
-> æ³¨ï¼šå¯¹äºŽæ‰€æœ‰åŸºäºŽå·²æœ‰æ•°æ®é›†æž„é€ çš„ä»»åŠ¡ï¼Œæˆ‘ä»¬å‡é€‰ç”¨åŽŸæœ‰æ•°æ®é›†çš„éªŒè¯é›†æˆ–æµ‹è¯•é›†çš„æ•°æ®ï¼ˆVCSUMä»»åŠ¡é™¤å¤–ï¼‰
+# Task construction
 
-- [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573)å’Œ[DuReader](https://github.com/baidu/DuReader)ä»»åŠ¡åŸºäºŽåŽŸå§‹çš„æ•°æ®é›†æž„å»ºï¼Œå¹¶è¿›è¡Œç›¸å…³å¤„ç†ä½¿å…¶é€‚ç”¨äºŽé•¿æ–‡æœ¬è¯„æµ‹ã€‚å…·ä½“åœ°ï¼Œå¯¹äºŽéªŒè¯é›†ä¸çš„é—®é¢˜ï¼Œæˆ‘ä»¬ä¼šé€‰å–åŒ…å«ç”æ¡ˆçš„evidence passageå’Œè‹¥å¹²å¹²æ‰°çš„æ–‡ç« ï¼Œè¿™äº›æ–‡ç« å’ŒåŽŸå§‹çš„é—®é¢˜å…±åŒç»„æˆäº†ç›¸å…³ä»»åŠ¡çš„è¾“å…¥ã€‚
-- MultiFiedQA-zhå’ŒMultiFieldQA-enä»»åŠ¡ç”±çº¦10ç§æ¥æºçš„é•¿æ–‡æœ¬æ•°æ®ç»„æˆï¼ŒåŒ…å«Latexè®ºæ–‡ã€è£åˆ¤æ–‡ä¹¦ã€æ”¿åºœå·¥ä½œæŠ¥å‘Šå’Œè°·æŒç´¢å¼•çš„PDFæ–‡æ¡£ç‰ã€‚å¯¹äºŽæ¯ç¯‡é•¿æ–‡æœ¬ï¼Œæˆ‘ä»¬é‚€è¯·äº†è‹¥å¹²åšå£«ç”Ÿå’Œç¡•å£«ç”Ÿæ¥è¿›è¡Œæ ‡æ³¨ï¼Œå³åŸºäºŽé•¿æ–‡æœ¬æé—®ï¼Œå¹¶ç»™å‡ºæ£ç¡®çš„ç”æ¡ˆã€‚ä¸ºäº†æ›´å¥½åœ°è¿›è¡Œè‡ªåŠ¨åŒ–è¯„æµ‹ï¼Œæˆ‘ä»¬è¦æ±‚æ ‡æ³¨å‘˜å°½å¯èƒ½æå‡ºæœ‰ç¡®å®šæ€§ç”æ¡ˆçš„é—®é¢˜ã€‚
-- [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf)å’Œ[QMSum](https://arxiv.org/pdf/2104.05938.pdf)ä»»åŠ¡ç›´æŽ¥ä½¿ç”¨åŽŸè®ºæ–‡æä¾›çš„æ•°æ®ã€‚åœ¨å…·ä½“çš„æž„å»ºä¸ï¼Œæˆ‘ä»¬ä½¿ç”¨[ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/)æä¾›çš„æ¨¡æ¿æ¥å°†å¯¹åº”çš„æ•°æ®è½¬æ¢ä¸ºçº¯æ–‡æœ¬çš„è¾“å…¥ã€‚
-- [VCSUM](https://arxiv.org/abs/2305.05280)ä»»åŠ¡åŸºäºŽåŽŸå§‹çš„æ•°æ®é›†æž„å»ºï¼Œæˆ‘ä»¬é’ˆå¯¹è¯¥æ•°æ®è®¾è®¡äº†ç›¸åº”çš„æ¨¡æ¿å°†å¯¹åº”çš„æ•°æ®è½¬æ¢ä¸ºçº¯æ–‡æœ¬çš„è¾“å…¥ã€‚
-- [TriviaQA](https://nlp.cs.washington.edu/triviaqa/)å’Œ[NQ](https://ai.google.com/research/NaturalQuestions/)ä»»åŠ¡å‚è€ƒ[CoLT5](https://arxiv.org/abs/2303.09752)çš„æ–¹å¼è¿›è¡Œæž„å»ºï¼Œå³ä¼šæä¾›è‹¥å¹²åŸºäºŽæ–‡æ¡£è¿›è¡Œé—®ç”çš„æ ·ä¾‹ï¼Œå¹¶è¦æ±‚è¯è¨€æ¨¡åž‹åŸºäºŽæ–°çš„æ–‡æ¡£å›žç”ç›¸å…³é—®é¢˜ã€‚
-- [TREC](https://aclanthology.org/C02-1150.pdf)å’Œ[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf)ä»»åŠ¡åŸºäºŽåŽŸå§‹çš„æ•°æ®é›†æž„å»ºã€‚å¯¹äºŽéªŒè¯é›†ä¸çš„æ¯ä¸ªé—®é¢˜ï¼Œæˆ‘ä»¬é‡‡æ ·è®ç»ƒé›†ä¸çš„è‹¥å¹²æ•°æ®ç»„æˆFew-shotæ ·ä¾‹ã€‚è¿™äº›æ ·ä¾‹ä¼šå’ŒéªŒè¯é›†ä¸çš„é—®é¢˜å…±åŒç»„æˆè¯¥ä»»åŠ¡çš„è¾“å…¥ã€‚
-- PassageRetrieval-enä»»åŠ¡åŸºäºŽè‹±æ–‡ç»´åŸºè¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘ä»¬éšæœºé‡‡æ ·30æ®µè‹±æ–‡ç»´åŸºçš„æ®µè½ï¼Œå¹¶é€‰å–å…¶ä¸ä¸€æ®µè¿›è¡Œæ‘˜è¦ï¼ˆä½¿ç”¨GPT-3.5-Turboï¼‰ã€‚è¯¥ä»»åŠ¡è¦æ±‚æ¨¡åž‹ç»™å‡ºæ‘˜è¦åº”è¯¥å¯¹åº”å“ªä¸ªçš„åŽŸå§‹æ®µè½ã€‚
-- PassageCountä»»åŠ¡åŸºäºŽè‹±æ–‡ç»´åŸºè¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘ä»¬éšæœºé‡‡æ ·è‹¥å¹²è‹±æ–‡ç»´åŸºçš„æ®µè½ï¼Œå¹¶å°†å…¶ä¸çš„æ¯ä¸ªæ®µè½éšæœºé‡å¤è‹¥å¹²æ¬¡ï¼Œæœ€åŽå°†æ®µè½éšæœºæ‰“ä¹±ã€‚è¯¥ä»»åŠ¡è¦æ±‚æ¨¡åž‹åˆ¤æ–ç»™å®šçš„è‹¥å¹²çš„æ®µè½ä¸ä¸é‡å¤çš„æ®µè½ä¸€å…±æœ‰å‡ ä¸ªã€‚
-- PassageRetrieval-zhä»»åŠ¡åŸºäºŽ[C4](https://arxiv.org/abs/1910.10683)è¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘ä»¬éšæœºé‡‡æ ·è‹¥å¹²æ®µæ¥è‡ªäºŽC4çš„ä¸æ–‡æ®µè½ï¼Œå¹¶é€‰å–å…¶ä¸ä¸€æ®µè¿›è¡Œæ‘˜è¦ï¼ˆä½¿ç”¨GPT-3.5-Turboï¼‰ã€‚è¯¥ä»»åŠ¡è¦æ±‚æ¨¡åž‹ç»™å‡ºæ‘˜è¦åº”è¯¥å¯¹åº”å“ªä¸ªçš„åŽŸå§‹æ®µè½ã€‚
-- [LCC](https://arxiv.org/abs/2306.14893)ä»»åŠ¡æˆ‘ä»¬åŸºäºŽåŽŸå§‹çš„ä»£ç è¡¥å…¨æ•°æ®é›†é‡‡æ ·æž„å»ºã€‚[RepoBench-P](https://arxiv.org/abs/2306.03091)ä»»åŠ¡ä¸æˆ‘ä»¬é€‰å–äº†åŽŸæ•°æ®é›†æœ€å…·æŒ‘æˆ˜æ€§çš„XF-Fï¼ˆCross-File-Firstï¼‰è®¾å®šï¼Œå¹¶ä¸”å‚è€ƒåŽŸæ–‡ä¸çš„Oracle-Filledåœºæ™¯ï¼Œå¯¹äºŽæ¯ä¸€æ¡åŽŸå§‹æ•°æ®æˆ‘ä»¬éšæœºæŠ½å–åŒ…æ‹¬æœ‰æ•ˆè·¨æ–‡ä»¶ä»£ç ç‰‡æ®µï¼ˆgold snippetï¼‰åœ¨å†…çš„å¤šä¸ªè·¨æ–‡ä»¶ä»£ç ç‰‡æ®µï¼Œå°†å…¶æ‹¼æŽ¥åŽä½œä¸ºè¾“å…¥ï¼Œè¦æ±‚æ¨¡åž‹ä»Žå…¶ä¸åˆ©ç”¨æœ‰æ•ˆçš„è·¨æ–‡ä»¶ä»£ç ä»¥è¡¥å…¨å½“å‰æ–‡ä»¶ä¸çš„ä»£ç ã€‚
\ No newline at end of file
+> Note: For all tasks constructed from existing datasets, we use data from the validation or test set of the existing dataset (except for VCSUM).
+
+- The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to make them suitable for long text evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the related tasks.
+- The tasks of MultiFiedQA-zh and MultiFieldQA-en consist of long-text data from about 10 sources, including Latex papers, judicial documents, government work reports, and PDF documents indexed by Google. For each long text, we invite several PhD and master students to annotate, i.e., to ask questions based on the long text and give the correct answers. To better automate evaluation, we require the annotators to ask questions with definitive answers as much as possible.
+- The tasks of [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), and [QMSum](https://arxiv.org/pdf/2104.05938.pdf) directly use the data provided by the original papers. In the specific construction, we use the template provided by [ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/) to convert the corresponding data into pure text input.
+- The [VCSUM](https://arxiv.org/abs/2305.05280) task is built based on the original dataset, and we have designed a corresponding template to convert the corresponding data into pure text input.
+- The tasks of [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) and [NQ](https://ai.google.com/research/NaturalQuestions/) are constructed in the manner of [CoLT5](https://arxiv.org/abs/2303.09752), which provides several examples of question and answering based on documents, and requires the language model to answer related questions based on new documents.
+- The tasks of [TREC](https://aclanthology.org/C02-1150.pdf) and [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf) are built based on the original datasets. For each question in the validation set, we sample several data from the training set to form few-shot examples. These examples together with the questions in the validation set constitute the input for this task.
+- The PassageRetrieval-en task is constructed based on English Wikipedia. For each piece of data, we randomly sample 30 paragraphs from English Wikipedia and select one for summarization (using GPT3.5 Turbo). The task requires the model to specify which original paragraph the summary corresponds to.
+- The PassageCount task is constructed based on the English wiki. For each piece of data, we randomly sample several passages from English Wikipedia, repeat each paragraph at random several times, and finally shuffle the paragraphs at random. This task requires the model to determine how many paragraphs are not repeated in a given number of paragraphs.
+- The PasskeyRetrieval-zh task is constructed based on [C4](https://arxiv.org/abs/1910.10683). For each piece of data, we randomly sample several Chinese paragraphs from C4 and select one of them for summary (using GPT3.5 Turbo). This task requires the model to give the original paragraph to which the summary should correspond.
+- For the [LCC](https://arxiv.org/abs/2306.14893) task, we sample from the original code completion dataset. In the [RepoBench-P](https://arxiv.org/abs/2306.03091) task, we select the most challenging XF-F (Cross-File-First) setting from the original dataset and refer to the Oracle-Filled scenario in the original text. For each original piece of data, we randomly extract multiple cross-file code snippets, including the effective cross-file code snippet (gold snippet), and concatenate them as input, requiring the model to use the effective cross-file code to complete the current file's code.
\ No newline at end of file
diff --git a/task_en.md b/task_en.md
deleted file mode 100644
index 186164f..0000000
--- a/task_en.md
+++ /dev/null
@@ -1,67 +0,0 @@
-# Task statistics
-
-| Task          | Task Type | Eval metric |     Avg len                            |Language | \#Sample |
-| --------- | -------------| ------------- |--------- | ------------- |--------- |
-| HotpotQA   | Multi-doc QA | F1                        |9149                           |EN                           |200                           |
-| 2WikiMultihopQA| Multi-doc QA | F1                        |4885                           |EN                           |200                           |
-| Musique| Multi-doc QA | F1                        |7798                           |EN                           |200                           |
-| DuReader| Multi-doc QA | Rouge-L                 |15768                           |ZH                           |200                           |
-| MultiFieldQA-en| Single-doc QA | F1                        |4559                           |EN                           |150                           |
-| MultiFieldQA-zh| Single-doc QA | F1                        |6771                           |ZH                           |200                           |
-| NarrativeQA| Single-doc QA | F1                        |18405                           |EN                           |200                           |
-| Qasper| Single-doc QA | F1                        |3619                           |EN                           |200                           |
-| GovReport| Summarization | Rouge-L                 |8169                           |EN                           |200                           |
-| QMSum| Summarization | Rouge-L                 |10546                           |EN                           |200                           |
-| VCSUM| Summarization | Rouge-L                 |15147                           |ZH                           |200                           |
-| TriviaQA| Few shot  | F1                        |8015                           |EN                           |200                           |
-| NQ| Few shot | F1                        |8210                           |EN                           |200                           |
-| TREC| Few shot | Accuracy                |5176                           |EN                           |200                           |
-| LSHT| Few shot | Accuracy                |22333                           |ZH                           |200                           |
-| PassageRetrieval-en| Synthetic | Accuracy                |9288                           |EN                           |200                           |
-| PassageCount| Synthetic | Accuracy                |11141                           |EN                           |200  |
-| PassageRetrieval-zh | Synthetic | Accuracy                |6745                           |ZH                           |200                           |
-| LCC| Code | Edit Sim              |1235                           |Python/C#/Java                           |500                           |
-| RepoBench-P| Code | Edit Sim                |5622                           |Python/Java                           |500                           |
-
-> Note: In order to avoid discrepancies caused by different tokenizers, we use the word count (using Python's split function) to calculate the average length of English datasets and code datasets, and use the character count to calculate the average length of Chinese datasets.
-
-# Task description
-
-| Task              | Task Description                                            |
-| ----------------- | ------------------------------------------------------------ |
-| HotpotQA          | Answer related questions based on multiple given documents   |
-| 2WikiMultihopQA   | Answer related questions based on multiple given documents   |
-| Musique           | Answer related questions based on multiple given documents   |
-| DuReader          | Answer related Chinese questions based on multiple retrieved documents |
-| MultiFieldQA-en   | Answer English questions based on a single document, which comes from a relatively diverse field |
-| MultiFieldQA-zh   | Answer Chinese questions based on a single document, which comes from a relatively diverse field |
-| NarrativeQA       | Ask questions based on stories or scripts, including understanding of important elements such as characters, plots, themes, etc. |
-| Qasper            | Ask questions based on a single paper, questions proposed by NLP readers, and answered by NLP practitioners |
-| GovReport         | A summarization task that requires summarizing government work reports |
-| QMSum             | A summarization task that requires summarizing meeting records based on user queries |
-| VCSUM             | A summarization task that requires summarizing Chinese meeting records |
-| TriviaQA          | Single document question answering task, providing several Few Shot examples |
-| NQ                | Single document question answering task, providing several Few Shot examples |
-| TREC              | A classification task that requires categorizing questions, includes 50 categories in total |
-| LSHT              | A Chinese classification task that requires categorizing news, includes 24 categories in total |
-| PassageRetrieval-en | Given 30 English Wikipedia paragraphs, determine which paragraph the given summary belongs to |
-| PassageCount | Determine the number of non-repeating paragraphs in a given number of paragraphs |
-| PassageRetrieval-zh | Given several Chinese paragraphs from the C4 data set, determine which paragraph the given abstract belongs to |
-| LCC               | Given a longer piece of code, predict the next line of code |
-| RepoBench-P       | Given code in multiple files within a GitHub repository (including inter-file dependencies), predict the next line of code |
-
-
-# Task construction
-
-> Note: For all tasks constructed from existing datasets, we use data from the validation or test set of the existing dataset (except for VCSUM).
-
-- The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to make them suitable for long text evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the related tasks.
-- The tasks of MultiFiedQA-zh and MultiFieldQA-en consist of long-text data from about 10 sources, including Latex papers, judicial documents, government work reports, and PDF documents indexed by Google. For each long text, we invite several PhD and master students to annotate, i.e., to ask questions based on the long text and give the correct answers. To better automate evaluation, we require the annotators to ask questions with definitive answers as much as possible.
-- The tasks of [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), and [QMSum](https://arxiv.org/pdf/2104.05938.pdf) directly use the data provided by the original papers. In the specific construction, we use the template provided by [ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/) to convert the corresponding data into pure text input.
-- The [VCSUM](https://arxiv.org/abs/2305.05280) task is built based on the original dataset, and we have designed a corresponding template to convert the corresponding data into pure text input.
-- The tasks of [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) and [NQ](https://ai.google.com/research/NaturalQuestions/) are constructed in the manner of [CoLT5](https://arxiv.org/abs/2303.09752), which provides several examples of question and answering based on documents, and requires the language model to answer related questions based on new documents.
-- The tasks of [TREC](https://aclanthology.org/C02-1150.pdf) and [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf) are built based on the original datasets. For each question in the validation set, we sample several data from the training set to form few-shot examples. These examples together with the questions in the validation set constitute the input for this task.
-- The PassageRetrieval-en task is constructed based on English Wikipedia. For each piece of data, we randomly sample 30 paragraphs from English Wikipedia and select one for summarization (using GPT3.5 Turbo). The task requires the model to specify which original paragraph the summary corresponds to.
-- The PassageCount task is constructed based on the English wiki. For each piece of data, we randomly sample several passages from English Wikipedia, repeat each paragraph at random several times, and finally shuffle the paragraphs at random. This task requires the model to determine how many paragraphs are not repeated in a given number of paragraphs.
-- The PasskeyRetrieval-zh task is constructed based on [C4](https://arxiv.org/abs/1910.10683). For each piece of data, we randomly sample several Chinese paragraphs from C4 and select one of them for summary (using GPT3.5 Turbo). This task requires the model to give the original paragraph to which the summary should correspond.
-- For the [LCC](https://arxiv.org/abs/2306.14893) task, we sample from the original code completion dataset. In the [RepoBench-P](https://arxiv.org/abs/2306.03091) task, we select the most challenging XF-F (Cross-File-First) setting from the original dataset and refer to the Oracle-Filled scenario in the original text. For each original piece of data, we randomly extract multiple cross-file code snippets, including the effective cross-file code snippet (gold snippet), and concatenate them as input, requiring the model to use the effective cross-file code to complete the current file's code.
\ No newline at end of file
diff --git a/task_zh.md b/task_zh.md
new file mode 100644
index 0000000..67a6a0d
--- /dev/null
+++ b/task_zh.md
@@ -0,0 +1,66 @@
+# ä»»åŠ¡ç»Ÿè®¡
+
+| ä»»åŠ¡              |      ä»»åŠ¡ç±»åž‹  |      è¯„ä»·æŒ‡æ ‡  |     å¹³å‡é•¿åº¦                                       |è¯è¨€ | Sampleæ•°é‡|
+| :--------- | :-----------:| :-----------: |:---------: | :-------------: |:---------: |
+| HotpotQA   | å¤šæ–‡æ¡£QA       | F1                        |9,149                           |è‹±æ–‡                           |200                           |
+| 2WikiMultihopQA| å¤šæ–‡æ¡£QA | F1                        |4,885                           |è‹±æ–‡                           |200                           |
+| Musique| å¤šæ–‡æ¡£QA   | F1                        |7,798                           |è‹±æ–‡                           |200                           |
+| DuReader| å¤šæ–‡æ¡£QA  | Rouge-L                 |15,768                           |ä¸æ–‡                           |200                           |
+| MultiFieldQA-en| å•æ–‡æ¡£QA | F1                        |4,559                           |è‹±æ–‡                           |150                           |
+| MultiFieldQA-zh| å•æ–‡æ¡£QA | F1                        |6,771                           |ä¸æ–‡                           |200                           |
+| NarrativeQA| å•æ–‡æ¡£QA | F1                        |18,405                           |è‹±æ–‡                           |200                           |
+| Qasper| å•æ–‡æ¡£QA    | F1                        |3,619                           |è‹±æ–‡                           |200                           |
+| GovReport| æ‘˜è¦ | Rouge-L                 |8,169                           |è‹±æ–‡                           |200                           |
+| QMSum| æ‘˜è¦     | Rouge-L                 |10,546                           |è‹±æ–‡                           |200                           |
+| VCSUM| æ‘˜è¦     | Rouge-L                 |15,147                           |ä¸æ–‡                           |200                           |
+| TriviaQA| Few shot  | F1                        |8,015                           |è‹±æ–‡                           |200                           |
+| NQ| Few shot | F1                        |8,210                           |è‹±æ–‡                           |200                           |
+| TREC| Few shot | Accuracy                |5,176                           |è‹±æ–‡                           |200                           |
+| LSHT| Few shot | Accuracy                |22,333                           |ä¸æ–‡                           |200                           |
+| PassageRetrieval-en| åˆæˆä»»åŠ¡ | Accuracy                |9,288                           |è‹±æ–‡                           |200                           |
+| PassageCount| åˆæˆä»»åŠ¡ | Accuracy                |11,141                           |è‹±æ–‡                           |200                           |
+| PassageRetrieval-zh | åˆæˆä»»åŠ¡ | Accuracy                |6,745                           |ä¸æ–‡                           |200                           |
+| LCC| ä»£ç  | Edit Sim              |1,235                           |Python/C#/Java                           |500                           |
+| RepoBench-P| ä»£ç  | Edit Sim                |5,622                           |Python/Java                           |500                           |
+
+> æ³¨ï¼šä¸ºäº†é¿å…ä¸åŒTokenizerç»Ÿè®¡çš„å·®è·ï¼Œæˆ‘ä»¬ä½¿ç”¨å•è¯æ•°ï¼ˆPythonçš„splitå‡½æ•°ï¼‰æ¥ç»Ÿè®¡è‹±æ–‡æ•°æ®é›†å’Œä»£ç æ•°æ®é›†çš„å¹³å‡é•¿åº¦ï¼Œä½¿ç”¨æ±‰å—æ•°æ¥ç»Ÿè®¡ä¸æ–‡æ•°æ®é›†çš„å¹³å‡é•¿åº¦ã€‚
+
+# ä»»åŠ¡è¯´æ˜Ž
+
+| ä»»åŠ¡              | ä»»åŠ¡è¯´æ˜Ž                                                     |
+| :----------------- | :----------------------------------------------------------- |
+| HotpotQA          | åŸºäºŽå¤šç¯‡ç»™å®šçš„æ–‡æ¡£ï¼Œå›žç”ç›¸å…³é—®é¢˜                             |
+| 2WikiMultihopQA   | åŸºäºŽå¤šç¯‡ç»™å®šçš„æ–‡æ¡£ï¼Œå›žç”ç›¸å…³é—®é¢˜                             |
+| Musique           | åŸºäºŽå¤šç¯‡ç»™å®šçš„æ–‡æ¡£ï¼Œå›žç”ç›¸å…³é—®é¢˜                             |
+| DuReader          | åŸºäºŽå¤šç¯‡ç»™å®šçš„æ£€ç´¢æ–‡æ¡£ï¼Œå›žç”ç›¸å…³çš„ä¸æ–‡é—®é¢˜                   |
+| MultiFieldQA-en   | åŸºäºŽå•ç¯‡æ–‡æ¡£ï¼Œå›žç”è‹±æ–‡é—®é¢˜ï¼Œæ–‡æ¡£æ‰€å±žçš„é¢†åŸŸç›¸å¯¹å¤šå…ƒ           |
+| MultiFieldQA-zh   | åŸºäºŽå•ç¯‡æ–‡æ¡£ï¼Œå›žç”ä¸æ–‡é—®é¢˜ï¼Œæ–‡æ¡£æ‰€å±žçš„é¢†åŸŸç›¸å¯¹å¤šå…ƒ           |
+| NarrativeQA       | åŸºäºŽæ•…äº‹æˆ–å‰§æœ¬æé—®ï¼ŒåŒ…æ‹¬å¯¹äººç‰©ã€æƒ…èŠ‚ã€ä¸»é¢˜ç‰é‡è¦å…ƒç´ çš„ç†è§£   |
+| Qasper            | åŸºäºŽå•ç¯‡è®ºæ–‡çš„æå‡ºï¼Œé—®é¢˜ç”±NLPçš„è¯»è€…æå‡ºï¼Œå¹¶ç”±NLPä»Žä¸šè€…å›žç”   |
+| GovReport         | æ‘˜è¦ä»»åŠ¡ï¼Œè¦æ±‚å¯¹æ”¿åºœçš„å·¥ä½œæŠ¥å‘Šè¿›è¡Œæ€»ç»“æ‘˜è¦                   |
+| QMSum             | æ‘˜è¦ä»»åŠ¡ï¼Œè¦æ±‚åŸºäºŽç”¨æˆ·çš„æŸ¥è¯¢å¯¹ä¼šè®®è®°å½•è¿›è¡Œæ‘˜è¦               |
+| VCSUM             | æ‘˜è¦ä»»åŠ¡ï¼Œè¦æ±‚å¯¹ä¸æ–‡ä¼šè®®è®°å½•è¿›è¡Œæ€»ç»“æ‘˜è¦                     |
+| TriviaQA          | å•æ–‡æ¡£é—®ç”ä»»åŠ¡ï¼Œæä¾›è‹¥å¹²çš„Few Shotæ ·ä¾‹                       |
+| NQ                | å•æ–‡æ¡£é—®ç”ä»»åŠ¡ï¼Œæä¾›è‹¥å¹²çš„Few Shotæ ·ä¾‹                       |
+| TREC              | åˆ†ç±»ä»»åŠ¡ï¼Œè¦æ±‚å¯¹é—®é¢˜è¿›è¡Œåˆ†ç±»ï¼Œä¸€å…±åŒ…å«50ä¸ªç±»åˆ«               |
+| LSHT              | ä¸æ–‡åˆ†ç±»ä»»åŠ¡ï¼Œè¦æ±‚å¯¹æ–°é—»è¿›è¡Œåˆ†ç±»ï¼Œä¸€å…±åŒ…å«24ä¸ªç±»åˆ«           |
+| PassageRetrieval-en | ç»™å®š30ä¸ªè‹±æ–‡ç»´åŸºçš„æ®µè½ï¼Œåˆ¤æ–ç»™å®šçš„æ‘˜è¦å±žäºŽå“ªä¸ªæ®µè½           |
+| PassageCount | åˆ¤æ–ç»™å®šçš„è‹¥å¹²çš„æ®µè½ä¸ä¸é‡å¤çš„æ®µè½ä¸€å…±æœ‰å‡ ä¸ª           |
+| PassageRetrieval-zh | ç»™å®šè‹¥å¹²ä¸ªå‡ºè‡ªC4æ•°æ®é›†çš„ä¸æ–‡æ®µè½ï¼Œåˆ¤æ–ç»™å®šçš„æ‘˜è¦å±žäºŽå“ªä¸ªæ®µè½ |
+| LCC               | ç»™å®šä¸€æ®µè¾ƒé•¿ä»£ç ï¼Œè¦æ±‚é¢„æµ‹å‡ºä¸‹ä¸€è¡Œä»£ç                        |
+| RepoBench-P       | ç»™å®šä¸€ä¸ªgithubä»“åº“å†…å¤šä¸ªæ–‡ä»¶ä¸çš„ä»£ç ï¼ˆåŒ…å«æ–‡ä»¶é—´ä¾èµ–ï¼‰ï¼Œè¦æ±‚é¢„æµ‹å‡ºä¸‹ä¸€è¡Œä»£ç  |
+
+# æ•°æ®æž„é€ æ–¹å¼
+
+> æ³¨ï¼šå¯¹äºŽæ‰€æœ‰åŸºäºŽå·²æœ‰æ•°æ®é›†æž„é€ çš„ä»»åŠ¡ï¼Œæˆ‘ä»¬å‡é€‰ç”¨åŽŸæœ‰æ•°æ®é›†çš„éªŒè¯é›†æˆ–æµ‹è¯•é›†çš„æ•°æ®ï¼ˆVCSUMä»»åŠ¡é™¤å¤–ï¼‰
+
+- [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573)å’Œ[DuReader](https://github.com/baidu/DuReader)ä»»åŠ¡åŸºäºŽåŽŸå§‹çš„æ•°æ®é›†æž„å»ºï¼Œå¹¶è¿›è¡Œç›¸å…³å¤„ç†ä½¿å…¶é€‚ç”¨äºŽé•¿æ–‡æœ¬è¯„æµ‹ã€‚å…·ä½“åœ°ï¼Œå¯¹äºŽéªŒè¯é›†ä¸çš„é—®é¢˜ï¼Œæˆ‘ä»¬ä¼šé€‰å–åŒ…å«ç”æ¡ˆçš„evidence passageå’Œè‹¥å¹²å¹²æ‰°çš„æ–‡ç« ï¼Œè¿™äº›æ–‡ç« å’ŒåŽŸå§‹çš„é—®é¢˜å…±åŒç»„æˆäº†ç›¸å…³ä»»åŠ¡çš„è¾“å…¥ã€‚
+- MultiFiedQA-zhå’ŒMultiFieldQA-enä»»åŠ¡ç”±çº¦10ç§æ¥æºçš„é•¿æ–‡æœ¬æ•°æ®ç»„æˆï¼ŒåŒ…å«Latexè®ºæ–‡ã€è£åˆ¤æ–‡ä¹¦ã€æ”¿åºœå·¥ä½œæŠ¥å‘Šå’Œè°·æŒç´¢å¼•çš„PDFæ–‡æ¡£ç‰ã€‚å¯¹äºŽæ¯ç¯‡é•¿æ–‡æœ¬ï¼Œæˆ‘ä»¬é‚€è¯·äº†è‹¥å¹²åšå£«ç”Ÿå’Œç¡•å£«ç”Ÿæ¥è¿›è¡Œæ ‡æ³¨ï¼Œå³åŸºäºŽé•¿æ–‡æœ¬æé—®ï¼Œå¹¶ç»™å‡ºæ£ç¡®çš„ç”æ¡ˆã€‚ä¸ºäº†æ›´å¥½åœ°è¿›è¡Œè‡ªåŠ¨åŒ–è¯„æµ‹ï¼Œæˆ‘ä»¬è¦æ±‚æ ‡æ³¨å‘˜å°½å¯èƒ½æå‡ºæœ‰ç¡®å®šæ€§ç”æ¡ˆçš„é—®é¢˜ã€‚
+- [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf)å’Œ[QMSum](https://arxiv.org/pdf/2104.05938.pdf)ä»»åŠ¡ç›´æŽ¥ä½¿ç”¨åŽŸè®ºæ–‡æä¾›çš„æ•°æ®ã€‚åœ¨å…·ä½“çš„æž„å»ºä¸ï¼Œæˆ‘ä»¬ä½¿ç”¨[ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/)æä¾›çš„æ¨¡æ¿æ¥å°†å¯¹åº”çš„æ•°æ®è½¬æ¢ä¸ºçº¯æ–‡æœ¬çš„è¾“å…¥ã€‚
+- [VCSUM](https://arxiv.org/abs/2305.05280)ä»»åŠ¡åŸºäºŽåŽŸå§‹çš„æ•°æ®é›†æž„å»ºï¼Œæˆ‘ä»¬é’ˆå¯¹è¯¥æ•°æ®è®¾è®¡äº†ç›¸åº”çš„æ¨¡æ¿å°†å¯¹åº”çš„æ•°æ®è½¬æ¢ä¸ºçº¯æ–‡æœ¬çš„è¾“å…¥ã€‚
+- [TriviaQA](https://nlp.cs.washington.edu/triviaqa/)å’Œ[NQ](https://ai.google.com/research/NaturalQuestions/)ä»»åŠ¡å‚è€ƒ[CoLT5](https://arxiv.org/abs/2303.09752)çš„æ–¹å¼è¿›è¡Œæž„å»ºï¼Œå³ä¼šæä¾›è‹¥å¹²åŸºäºŽæ–‡æ¡£è¿›è¡Œé—®ç”çš„æ ·ä¾‹ï¼Œå¹¶è¦æ±‚è¯è¨€æ¨¡åž‹åŸºäºŽæ–°çš„æ–‡æ¡£å›žç”ç›¸å…³é—®é¢˜ã€‚
+- [TREC](https://aclanthology.org/C02-1150.pdf)å’Œ[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf)ä»»åŠ¡åŸºäºŽåŽŸå§‹çš„æ•°æ®é›†æž„å»ºã€‚å¯¹äºŽéªŒè¯é›†ä¸çš„æ¯ä¸ªé—®é¢˜ï¼Œæˆ‘ä»¬é‡‡æ ·è®ç»ƒé›†ä¸çš„è‹¥å¹²æ•°æ®ç»„æˆFew-shotæ ·ä¾‹ã€‚è¿™äº›æ ·ä¾‹ä¼šå’ŒéªŒè¯é›†ä¸çš„é—®é¢˜å…±åŒç»„æˆè¯¥ä»»åŠ¡çš„è¾“å…¥ã€‚
+- PassageRetrieval-enä»»åŠ¡åŸºäºŽè‹±æ–‡ç»´åŸºè¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘ä»¬éšæœºé‡‡æ ·30æ®µè‹±æ–‡ç»´åŸºçš„æ®µè½ï¼Œå¹¶é€‰å–å…¶ä¸ä¸€æ®µè¿›è¡Œæ‘˜è¦ï¼ˆä½¿ç”¨GPT-3.5-Turboï¼‰ã€‚è¯¥ä»»åŠ¡è¦æ±‚æ¨¡åž‹ç»™å‡ºæ‘˜è¦åº”è¯¥å¯¹åº”å“ªä¸ªçš„åŽŸå§‹æ®µè½ã€‚
+- PassageCountä»»åŠ¡åŸºäºŽè‹±æ–‡ç»´åŸºè¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘ä»¬éšæœºé‡‡æ ·è‹¥å¹²è‹±æ–‡ç»´åŸºçš„æ®µè½ï¼Œå¹¶å°†å…¶ä¸çš„æ¯ä¸ªæ®µè½éšæœºé‡å¤è‹¥å¹²æ¬¡ï¼Œæœ€åŽå°†æ®µè½éšæœºæ‰“ä¹±ã€‚è¯¥ä»»åŠ¡è¦æ±‚æ¨¡åž‹åˆ¤æ–ç»™å®šçš„è‹¥å¹²çš„æ®µè½ä¸ä¸é‡å¤çš„æ®µè½ä¸€å…±æœ‰å‡ ä¸ªã€‚
+- PassageRetrieval-zhä»»åŠ¡åŸºäºŽ[C4](https://arxiv.org/abs/1910.10683)è¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘ä»¬éšæœºé‡‡æ ·è‹¥å¹²æ®µæ¥è‡ªäºŽC4çš„ä¸æ–‡æ®µè½ï¼Œå¹¶é€‰å–å…¶ä¸ä¸€æ®µè¿›è¡Œæ‘˜è¦ï¼ˆä½¿ç”¨GPT-3.5-Turboï¼‰ã€‚è¯¥ä»»åŠ¡è¦æ±‚æ¨¡åž‹ç»™å‡ºæ‘˜è¦åº”è¯¥å¯¹åº”å“ªä¸ªçš„åŽŸå§‹æ®µè½ã€‚
+- [LCC](https://arxiv.org/abs/2306.14893)ä»»åŠ¡æˆ‘ä»¬åŸºäºŽåŽŸå§‹çš„ä»£ç è¡¥å…¨æ•°æ®é›†é‡‡æ ·æž„å»ºã€‚[RepoBench-P](https://arxiv.org/abs/2306.03091)ä»»åŠ¡ä¸æˆ‘ä»¬é€‰å–äº†åŽŸæ•°æ®é›†æœ€å…·æŒ‘æˆ˜æ€§çš„XF-Fï¼ˆCross-File-Firstï¼‰è®¾å®šï¼Œå¹¶ä¸”å‚è€ƒåŽŸæ–‡ä¸çš„Oracle-Filledåœºæ™¯ï¼Œå¯¹äºŽæ¯ä¸€æ¡åŽŸå§‹æ•°æ®æˆ‘ä»¬éšæœºæŠ½å–åŒ…æ‹¬æœ‰æ•ˆè·¨æ–‡ä»¶ä»£ç ç‰‡æ®µï¼ˆgold snippetï¼‰åœ¨å†…çš„å¤šä¸ªè·¨æ–‡ä»¶ä»£ç ç‰‡æ®µï¼Œå°†å…¶æ‹¼æŽ¥åŽä½œä¸ºè¾“å…¥ï¼Œè¦æ±‚æ¨¡åž‹ä»Žå…¶ä¸åˆ©ç”¨æœ‰æ•ˆçš„è·¨æ–‡ä»¶ä»£ç ä»¥è¡¥å…¨å½“å‰æ–‡ä»¶ä¸çš„ä»£ç ã€‚
\ No newline at end of file
-- 
GitLab