diff --git a/task.md b/task.md index 47bc8af428a599196baf42ef3a8dda6d71898f45..121cd2fa94552988333ae65d8a562883b8b76a03 100644 --- a/task.md +++ b/task.md @@ -4,7 +4,7 @@ | :-------- | :-----------:| :-----------: |:-------: | :-----------: |:--------: | | HotpotQA | Multi-doc QA | F1 |9,149 |EN |200 | | 2WikiMultihopQA| Multi-doc QA | F1 |4,885 |EN |200 | -| Musique| Multi-doc QA | F1 |7,798 |EN |200 | +| MuSiQue| Multi-doc QA | F1 |11,018 |EN |200 | | DuReader| Multi-doc QA | Rouge-L |15,768 |ZH |200 | | MultiFieldQA-en| Single-doc QA | F1 |4,559 |EN |150 | | MultiFieldQA-zh| Single-doc QA | F1 |6,771 |ZH |200 | @@ -31,7 +31,7 @@ | :---------------- | :----------------------------------------------------------- | | HotpotQA | Answer related questions based on multiple given documents | | 2WikiMultihopQA | Answer related questions based on multiple given documents | -| Musique | Answer related questions based on multiple given documents | +| MuSiQue | Answer related questions based on multiple given documents | | DuReader | Answer related Chinese questions based on multiple retrieved documents | | MultiFieldQA-en | Answer English questions based on a long article, which comes from a relatively diverse field | | MultiFieldQA-zh | Answer Chinese questions based on a long article, which comes from a relatively diverse field | @@ -55,7 +55,7 @@ > Note: For all tasks constructed from existing datasets, we use data from the validation or test set of the existing dataset (except for VCSUM). -- The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to be suitable for long context evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the tasks. +- The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [MuSiQue](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to be suitable for long context evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the tasks. - The tasks of MultiFiedQA-zh and MultiFieldQA-en consist of long artical data from about 10 sources, including Latex papers, judicial documents, government work reports, and PDF documents indexed by Google. For each long artical, we invite several PhD and master students to annotate, i.e., to ask questions based on the long artical and give the correct answers. To better automate evaluation, we ask the annotators to propose questions with definitive answers as much as possible. - The tasks of [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), and [QMSum](https://arxiv.org/pdf/2104.05938.pdf) directly use the data provided by the original papers. In the specific construction, we use the template provided by [ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/) to convert the corresponding data into pure text input. - The [VCSUM](https://arxiv.org/abs/2305.05280) task is built based on the original dataset, and we design a corresponding template to convert the corresponding data into pure text input. diff --git a/task_zh.md b/task_zh.md index 67a6a0dfcb938bfbb39b8168261c9b0455ffbad9..af2a1ae15a1fb0633fde314976db3215289c09ee 100644 --- a/task_zh.md +++ b/task_zh.md @@ -4,7 +4,7 @@ | :--------- | :-----------:| :-----------: |:---------: | :-------------: |:---------: | | HotpotQA | 多文档QA | F1 |9,149 |英文 |200 | | 2WikiMultihopQA| 多文档QA | F1 |4,885 |英文 |200 | -| Musique| 多文档QA | F1 |7,798 |英文 |200 | +| MuSiQue| 多文档QA | F1 |11,018 |英文 |200 | | DuReader| 多文档QA | Rouge-L |15,768 |ä¸æ–‡ |200 | | MultiFieldQA-en| å•æ–‡æ¡£QA | F1 |4,559 |英文 |150 | | MultiFieldQA-zh| å•æ–‡æ¡£QA | F1 |6,771 |ä¸æ–‡ |200 | @@ -31,7 +31,7 @@ | :----------------- | :----------------------------------------------------------- | | HotpotQA | 基于多篇给定的文档,回ç”相关问题 | | 2WikiMultihopQA | 基于多篇给定的文档,回ç”相关问题 | -| Musique | 基于多篇给定的文档,回ç”相关问题 | +| MuSiQue | 基于多篇给定的文档,回ç”相关问题 | | DuReader | 基于多篇给定的检索文档,回ç”相关的ä¸æ–‡é—®é¢˜ | | MultiFieldQA-en | 基于å•ç¯‡æ–‡æ¡£ï¼Œå›žç”英文问题,文档所属的领域相对多元 | | MultiFieldQA-zh | 基于å•ç¯‡æ–‡æ¡£ï¼Œå›žç”ä¸æ–‡é—®é¢˜ï¼Œæ–‡æ¡£æ‰€å±žçš„领域相对多元 | @@ -54,7 +54,7 @@ > 注:对于所有基于已有数æ®é›†æž„é€ çš„ä»»åŠ¡ï¼Œæˆ‘ä»¬å‡é€‰ç”¨åŽŸæœ‰æ•°æ®é›†çš„验è¯é›†æˆ–测试集的数æ®ï¼ˆVCSUM任务除外) -- [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573)å’Œ[DuReader](https://github.com/baidu/DuReader)任务基于原始的数æ®é›†æž„建,并进行相关处ç†ä½¿å…¶é€‚用于长文本评测。具体地,对于验è¯é›†ä¸çš„问题,我们会选å–包å«ç”案的evidence passageå’Œè‹¥å¹²å¹²æ‰°çš„æ–‡ç« ï¼Œè¿™äº›æ–‡ç« å’ŒåŽŸå§‹çš„é—®é¢˜å…±åŒç»„æˆäº†ç›¸å…³ä»»åŠ¡çš„输入。 +- [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [MuSiQue](https://arxiv.org/abs/2108.00573)å’Œ[DuReader](https://github.com/baidu/DuReader)任务基于原始的数æ®é›†æž„建,并进行相关处ç†ä½¿å…¶é€‚用于长文本评测。具体地,对于验è¯é›†ä¸çš„问题,我们会选å–包å«ç”案的evidence passageå’Œè‹¥å¹²å¹²æ‰°çš„æ–‡ç« ï¼Œè¿™äº›æ–‡ç« å’ŒåŽŸå§‹çš„é—®é¢˜å…±åŒç»„æˆäº†ç›¸å…³ä»»åŠ¡çš„输入。 - MultiFiedQA-zhå’ŒMultiFieldQA-en任务由约10ç§æ¥æºçš„长文本数æ®ç»„æˆï¼ŒåŒ…å«Latex论文ã€è£åˆ¤æ–‡ä¹¦ã€æ”¿åºœå·¥ä½œæŠ¥å‘Šå’Œè°·æŒç´¢å¼•çš„PDF文档ç‰ã€‚对于æ¯ç¯‡é•¿æ–‡æœ¬ï¼Œæˆ‘们邀请了若干åšå£«ç”Ÿå’Œç¡•å£«ç”Ÿæ¥è¿›è¡Œæ ‡æ³¨ï¼Œå³åŸºäºŽé•¿æ–‡æœ¬æ问,并给出æ£ç¡®çš„ç”案。为了更好地进行自动化评测,我们è¦æ±‚æ ‡æ³¨å‘˜å°½å¯èƒ½æ出有确定性ç”案的问题。 - [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf)å’Œ[QMSum](https://arxiv.org/pdf/2104.05938.pdf)任务直接使用原论文æ供的数æ®ã€‚在具体的构建ä¸ï¼Œæˆ‘们使用[ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/)æ供的模æ¿æ¥å°†å¯¹åº”çš„æ•°æ®è½¬æ¢ä¸ºçº¯æ–‡æœ¬çš„输入。 - [VCSUM](https://arxiv.org/abs/2305.05280)任务基于原始的数æ®é›†æž„建,我们针对该数æ®è®¾è®¡äº†ç›¸åº”的模æ¿å°†å¯¹åº”çš„æ•°æ®è½¬æ¢ä¸ºçº¯æ–‡æœ¬çš„输入。 @@ -62,5 +62,5 @@ - [TREC](https://aclanthology.org/C02-1150.pdf)å’Œ[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf)任务基于原始的数æ®é›†æž„建。对于验è¯é›†ä¸çš„æ¯ä¸ªé—®é¢˜ï¼Œæˆ‘ä»¬é‡‡æ ·è®ç»ƒé›†ä¸çš„若干数æ®ç»„æˆFew-shotæ ·ä¾‹ã€‚è¿™äº›æ ·ä¾‹ä¼šå’ŒéªŒè¯é›†ä¸çš„问题共åŒç»„æˆè¯¥ä»»åŠ¡çš„输入。 - PassageRetrieval-enä»»åŠ¡åŸºäºŽè‹±æ–‡ç»´åŸºè¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘们éšæœºé‡‡æ ·30段英文维基的段è½ï¼Œå¹¶é€‰å–å…¶ä¸ä¸€æ®µè¿›è¡Œæ‘˜è¦ï¼ˆä½¿ç”¨GPT-3.5-Turbo)。该任务è¦æ±‚模型给出摘è¦åº”该对应哪个的原始段è½ã€‚ - PassageCountä»»åŠ¡åŸºäºŽè‹±æ–‡ç»´åŸºè¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘们éšæœºé‡‡æ ·è‹¥å¹²è‹±æ–‡ç»´åŸºçš„段è½ï¼Œå¹¶å°†å…¶ä¸çš„æ¯ä¸ªæ®µè½éšæœºé‡å¤è‹¥å¹²æ¬¡ï¼Œæœ€åŽå°†æ®µè½éšæœºæ‰“乱。该任务è¦æ±‚模型判æ–给定的若干的段è½ä¸ä¸é‡å¤çš„段è½ä¸€å…±æœ‰å‡ 个。 -- PassageRetrieval-zh任务基于[C4](https://arxiv.org/abs/1910.10683)è¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘们éšæœºé‡‡æ ·è‹¥å¹²æ®µæ¥è‡ªäºŽC4çš„ä¸æ–‡æ®µè½ï¼Œå¹¶é€‰å–å…¶ä¸ä¸€æ®µè¿›è¡Œæ‘˜è¦ï¼ˆä½¿ç”¨GPT-3.5-Turbo)。该任务è¦æ±‚模型给出摘è¦åº”该对应哪个的原始段è½ã€‚ +- PassageRetrieval-zh任务基于[C4](https://arxiv.org/abs/1910.10683)è¿›è¡Œæž„é€ ã€‚å¯¹äºŽæ¯æ¡æ•°æ®ï¼Œæˆ‘们éšæœºé‡‡æ ·è‹¥å¹²æ®µæ¥è‡ªäºŽC4çš„ä¸æ–‡æ®µè½ï¼Œå¹¶é€‰å–å…¶ä¸ä¸€æ®µè¿›è¡Œæ‘˜è¦ï¼ˆä½¿ç”¨GPT-3.5-Turbo)。该任务è¦æ±‚模型给出摘è¦å¯¹åº”的那个原始段è½å称。 - [LCC](https://arxiv.org/abs/2306.14893)任务我们基于原始的代ç 补全数æ®é›†é‡‡æ ·æž„建。[RepoBench-P](https://arxiv.org/abs/2306.03091)任务ä¸æˆ‘们选å–了原数æ®é›†æœ€å…·æŒ‘战性的XF-F(Cross-File-First)设定,并且å‚考原文ä¸çš„Oracle-Filled场景,对于æ¯ä¸€æ¡åŽŸå§‹æ•°æ®æˆ‘们éšæœºæŠ½å–包括有效跨文件代ç 片段(gold snippet)在内的多个跨文件代ç 片段,将其拼接åŽä½œä¸ºè¾“入,è¦æ±‚模型从其ä¸åˆ©ç”¨æœ‰æ•ˆçš„跨文件代ç 以补全当å‰æ–‡ä»¶ä¸çš„代ç 。 \ No newline at end of file