Skip to content
Snippets Groups Projects
Commit d01ebf60 authored by bys0318's avatar bys0318
Browse files

update MuSiQue statistic

parent 7325fa04
No related branches found
No related tags found
No related merge requests found
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
| :-------- | :-----------:| :-----------: |:-------: | :-----------: |:--------: | | :-------- | :-----------:| :-----------: |:-------: | :-----------: |:--------: |
| HotpotQA | Multi-doc QA | F1 |9,149 |EN |200 | | HotpotQA | Multi-doc QA | F1 |9,149 |EN |200 |
| 2WikiMultihopQA| Multi-doc QA | F1 |4,885 |EN |200 | | 2WikiMultihopQA| Multi-doc QA | F1 |4,885 |EN |200 |
| Musique| Multi-doc QA | F1 |7,798 |EN |200 | | MuSiQue| Multi-doc QA | F1 |11,018 |EN |200 |
| DuReader| Multi-doc QA | Rouge-L |15,768 |ZH |200 | | DuReader| Multi-doc QA | Rouge-L |15,768 |ZH |200 |
| MultiFieldQA-en| Single-doc QA | F1 |4,559 |EN |150 | | MultiFieldQA-en| Single-doc QA | F1 |4,559 |EN |150 |
| MultiFieldQA-zh| Single-doc QA | F1 |6,771 |ZH |200 | | MultiFieldQA-zh| Single-doc QA | F1 |6,771 |ZH |200 |
...@@ -31,7 +31,7 @@ ...@@ -31,7 +31,7 @@
| :---------------- | :----------------------------------------------------------- | | :---------------- | :----------------------------------------------------------- |
| HotpotQA | Answer related questions based on multiple given documents | | HotpotQA | Answer related questions based on multiple given documents |
| 2WikiMultihopQA | Answer related questions based on multiple given documents | | 2WikiMultihopQA | Answer related questions based on multiple given documents |
| Musique | Answer related questions based on multiple given documents | | MuSiQue | Answer related questions based on multiple given documents |
| DuReader | Answer related Chinese questions based on multiple retrieved documents | | DuReader | Answer related Chinese questions based on multiple retrieved documents |
| MultiFieldQA-en | Answer English questions based on a long article, which comes from a relatively diverse field | | MultiFieldQA-en | Answer English questions based on a long article, which comes from a relatively diverse field |
| MultiFieldQA-zh | Answer Chinese questions based on a long article, which comes from a relatively diverse field | | MultiFieldQA-zh | Answer Chinese questions based on a long article, which comes from a relatively diverse field |
...@@ -55,7 +55,7 @@ ...@@ -55,7 +55,7 @@
> Note: For all tasks constructed from existing datasets, we use data from the validation or test set of the existing dataset (except for VCSUM). > Note: For all tasks constructed from existing datasets, we use data from the validation or test set of the existing dataset (except for VCSUM).
- The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to be suitable for long context evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the tasks. - The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [MuSiQue](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to be suitable for long context evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the tasks.
- The tasks of MultiFiedQA-zh and MultiFieldQA-en consist of long artical data from about 10 sources, including Latex papers, judicial documents, government work reports, and PDF documents indexed by Google. For each long artical, we invite several PhD and master students to annotate, i.e., to ask questions based on the long artical and give the correct answers. To better automate evaluation, we ask the annotators to propose questions with definitive answers as much as possible. - The tasks of MultiFiedQA-zh and MultiFieldQA-en consist of long artical data from about 10 sources, including Latex papers, judicial documents, government work reports, and PDF documents indexed by Google. For each long artical, we invite several PhD and master students to annotate, i.e., to ask questions based on the long artical and give the correct answers. To better automate evaluation, we ask the annotators to propose questions with definitive answers as much as possible.
- The tasks of [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), and [QMSum](https://arxiv.org/pdf/2104.05938.pdf) directly use the data provided by the original papers. In the specific construction, we use the template provided by [ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/) to convert the corresponding data into pure text input. - The tasks of [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), and [QMSum](https://arxiv.org/pdf/2104.05938.pdf) directly use the data provided by the original papers. In the specific construction, we use the template provided by [ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/) to convert the corresponding data into pure text input.
- The [VCSUM](https://arxiv.org/abs/2305.05280) task is built based on the original dataset, and we design a corresponding template to convert the corresponding data into pure text input. - The [VCSUM](https://arxiv.org/abs/2305.05280) task is built based on the original dataset, and we design a corresponding template to convert the corresponding data into pure text input.
......
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
| :--------- | :-----------:| :-----------: |:---------: | :-------------: |:---------: | | :--------- | :-----------:| :-----------: |:---------: | :-------------: |:---------: |
| HotpotQA | 多文档QA | F1 |9,149 |英文 |200 | | HotpotQA | 多文档QA | F1 |9,149 |英文 |200 |
| 2WikiMultihopQA| 多文档QA | F1 |4,885 |英文 |200 | | 2WikiMultihopQA| 多文档QA | F1 |4,885 |英文 |200 |
| Musique| 多文档QA | F1 |7,798 |英文 |200 | | MuSiQue| 多文档QA | F1 |11,018 |英文 |200 |
| DuReader| 多文档QA | Rouge-L |15,768 |中文 |200 | | DuReader| 多文档QA | Rouge-L |15,768 |中文 |200 |
| MultiFieldQA-en| 单文档QA | F1 |4,559 |英文 |150 | | MultiFieldQA-en| 单文档QA | F1 |4,559 |英文 |150 |
| MultiFieldQA-zh| 单文档QA | F1 |6,771 |中文 |200 | | MultiFieldQA-zh| 单文档QA | F1 |6,771 |中文 |200 |
...@@ -31,7 +31,7 @@ ...@@ -31,7 +31,7 @@
| :----------------- | :----------------------------------------------------------- | | :----------------- | :----------------------------------------------------------- |
| HotpotQA | 基于多篇给定的文档,回答相关问题 | | HotpotQA | 基于多篇给定的文档,回答相关问题 |
| 2WikiMultihopQA | 基于多篇给定的文档,回答相关问题 | | 2WikiMultihopQA | 基于多篇给定的文档,回答相关问题 |
| Musique | 基于多篇给定的文档,回答相关问题 | | MuSiQue | 基于多篇给定的文档,回答相关问题 |
| DuReader | 基于多篇给定的检索文档,回答相关的中文问题 | | DuReader | 基于多篇给定的检索文档,回答相关的中文问题 |
| MultiFieldQA-en | 基于单篇文档,回答英文问题,文档所属的领域相对多元 | | MultiFieldQA-en | 基于单篇文档,回答英文问题,文档所属的领域相对多元 |
| MultiFieldQA-zh | 基于单篇文档,回答中文问题,文档所属的领域相对多元 | | MultiFieldQA-zh | 基于单篇文档,回答中文问题,文档所属的领域相对多元 |
...@@ -54,7 +54,7 @@ ...@@ -54,7 +54,7 @@
> 注:对于所有基于已有数据集构造的任务,我们均选用原有数据集的验证集或测试集的数据(VCSUM任务除外) > 注:对于所有基于已有数据集构造的任务,我们均选用原有数据集的验证集或测试集的数据(VCSUM任务除外)
- [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573)[DuReader](https://github.com/baidu/DuReader)任务基于原始的数据集构建,并进行相关处理使其适用于长文本评测。具体地,对于验证集中的问题,我们会选取包含答案的evidence passage和若干干扰的文章,这些文章和原始的问题共同组成了相关任务的输入。 - [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [MuSiQue](https://arxiv.org/abs/2108.00573)[DuReader](https://github.com/baidu/DuReader)任务基于原始的数据集构建,并进行相关处理使其适用于长文本评测。具体地,对于验证集中的问题,我们会选取包含答案的evidence passage和若干干扰的文章,这些文章和原始的问题共同组成了相关任务的输入。
- MultiFiedQA-zh和MultiFieldQA-en任务由约10种来源的长文本数据组成,包含Latex论文、裁判文书、政府工作报告和谷歌索引的PDF文档等。对于每篇长文本,我们邀请了若干博士生和硕士生来进行标注,即基于长文本提问,并给出正确的答案。为了更好地进行自动化评测,我们要求标注员尽可能提出有确定性答案的问题。 - MultiFiedQA-zh和MultiFieldQA-en任务由约10种来源的长文本数据组成,包含Latex论文、裁判文书、政府工作报告和谷歌索引的PDF文档等。对于每篇长文本,我们邀请了若干博士生和硕士生来进行标注,即基于长文本提问,并给出正确的答案。为了更好地进行自动化评测,我们要求标注员尽可能提出有确定性答案的问题。
- [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf)[QMSum](https://arxiv.org/pdf/2104.05938.pdf)任务直接使用原论文提供的数据。在具体的构建中,我们使用[ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/)提供的模板来将对应的数据转换为纯文本的输入。 - [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf)[QMSum](https://arxiv.org/pdf/2104.05938.pdf)任务直接使用原论文提供的数据。在具体的构建中,我们使用[ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/)提供的模板来将对应的数据转换为纯文本的输入。
- [VCSUM](https://arxiv.org/abs/2305.05280)任务基于原始的数据集构建,我们针对该数据设计了相应的模板将对应的数据转换为纯文本的输入。 - [VCSUM](https://arxiv.org/abs/2305.05280)任务基于原始的数据集构建,我们针对该数据设计了相应的模板将对应的数据转换为纯文本的输入。
...@@ -62,5 +62,5 @@ ...@@ -62,5 +62,5 @@
- [TREC](https://aclanthology.org/C02-1150.pdf)[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf)任务基于原始的数据集构建。对于验证集中的每个问题,我们采样训练集中的若干数据组成Few-shot样例。这些样例会和验证集中的问题共同组成该任务的输入。 - [TREC](https://aclanthology.org/C02-1150.pdf)[LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf)任务基于原始的数据集构建。对于验证集中的每个问题,我们采样训练集中的若干数据组成Few-shot样例。这些样例会和验证集中的问题共同组成该任务的输入。
- PassageRetrieval-en任务基于英文维基进行构造。对于每条数据,我们随机采样30段英文维基的段落,并选取其中一段进行摘要(使用GPT-3.5-Turbo)。该任务要求模型给出摘要应该对应哪个的原始段落。 - PassageRetrieval-en任务基于英文维基进行构造。对于每条数据,我们随机采样30段英文维基的段落,并选取其中一段进行摘要(使用GPT-3.5-Turbo)。该任务要求模型给出摘要应该对应哪个的原始段落。
- PassageCount任务基于英文维基进行构造。对于每条数据,我们随机采样若干英文维基的段落,并将其中的每个段落随机重复若干次,最后将段落随机打乱。该任务要求模型判断给定的若干的段落中不重复的段落一共有几个。 - PassageCount任务基于英文维基进行构造。对于每条数据,我们随机采样若干英文维基的段落,并将其中的每个段落随机重复若干次,最后将段落随机打乱。该任务要求模型判断给定的若干的段落中不重复的段落一共有几个。
- PassageRetrieval-zh任务基于[C4](https://arxiv.org/abs/1910.10683)进行构造。对于每条数据,我们随机采样若干段来自于C4的中文段落,并选取其中一段进行摘要(使用GPT-3.5-Turbo)。该任务要求模型给出摘要应该对应哪个的原始段落。 - PassageRetrieval-zh任务基于[C4](https://arxiv.org/abs/1910.10683)进行构造。对于每条数据,我们随机采样若干段来自于C4的中文段落,并选取其中一段进行摘要(使用GPT-3.5-Turbo)。该任务要求模型给出摘要对应的那个原始段落名称
- [LCC](https://arxiv.org/abs/2306.14893)任务我们基于原始的代码补全数据集采样构建。[RepoBench-P](https://arxiv.org/abs/2306.03091)任务中我们选取了原数据集最具挑战性的XF-F(Cross-File-First)设定,并且参考原文中的Oracle-Filled场景,对于每一条原始数据我们随机抽取包括有效跨文件代码片段(gold snippet)在内的多个跨文件代码片段,将其拼接后作为输入,要求模型从其中利用有效的跨文件代码以补全当前文件中的代码。 - [LCC](https://arxiv.org/abs/2306.14893)任务我们基于原始的代码补全数据集采样构建。[RepoBench-P](https://arxiv.org/abs/2306.03091)任务中我们选取了原数据集最具挑战性的XF-F(Cross-File-First)设定,并且参考原文中的Oracle-Filled场景,对于每一条原始数据我们随机抽取包括有效跨文件代码片段(gold snippet)在内的多个跨文件代码片段,将其拼接后作为输入,要求模型从其中利用有效的跨文件代码以补全当前文件中的代码。
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment