| HotpotQA | Answer related questions based on multiple given documents |
| HotpotQA | Answer related questions based on multiple given documents |
| 2WikiMultihopQA | Answer related questions based on multiple given documents |
| 2WikiMultihopQA | Answer related questions based on multiple given documents |
| Musique | Answer related questions based on multiple given documents |
| MuSiQue | Answer related questions based on multiple given documents |
| DuReader | Answer related Chinese questions based on multiple retrieved documents |
| DuReader | Answer related Chinese questions based on multiple retrieved documents |
| MultiFieldQA-en | Answer English questions based on a long article, which comes from a relatively diverse field |
| MultiFieldQA-en | Answer English questions based on a long article, which comes from a relatively diverse field |
| MultiFieldQA-zh | Answer Chinese questions based on a long article, which comes from a relatively diverse field |
| MultiFieldQA-zh | Answer Chinese questions based on a long article, which comes from a relatively diverse field |
...
@@ -55,7 +55,7 @@
...
@@ -55,7 +55,7 @@
> Note: For all tasks constructed from existing datasets, we use data from the validation or test set of the existing dataset (except for VCSUM).
> Note: For all tasks constructed from existing datasets, we use data from the validation or test set of the existing dataset (except for VCSUM).
- The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [Musique](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to be suitable for long context evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the tasks.
- The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [MuSiQue](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to be suitable for long context evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the tasks.
- The tasks of MultiFiedQA-zh and MultiFieldQA-en consist of long artical data from about 10 sources, including Latex papers, judicial documents, government work reports, and PDF documents indexed by Google. For each long artical, we invite several PhD and master students to annotate, i.e., to ask questions based on the long artical and give the correct answers. To better automate evaluation, we ask the annotators to propose questions with definitive answers as much as possible.
- The tasks of MultiFiedQA-zh and MultiFieldQA-en consist of long artical data from about 10 sources, including Latex papers, judicial documents, government work reports, and PDF documents indexed by Google. For each long artical, we invite several PhD and master students to annotate, i.e., to ask questions based on the long artical and give the correct answers. To better automate evaluation, we ask the annotators to propose questions with definitive answers as much as possible.
- The tasks of [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), and [QMSum](https://arxiv.org/pdf/2104.05938.pdf) directly use the data provided by the original papers. In the specific construction, we use the template provided by [ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/) to convert the corresponding data into pure text input.
- The tasks of [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), and [QMSum](https://arxiv.org/pdf/2104.05938.pdf) directly use the data provided by the original papers. In the specific construction, we use the template provided by [ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/) to convert the corresponding data into pure text input.
- The [VCSUM](https://arxiv.org/abs/2305.05280) task is built based on the original dataset, and we design a corresponding template to convert the corresponding data into pure text input.
- The [VCSUM](https://arxiv.org/abs/2305.05280) task is built based on the original dataset, and we design a corresponding template to convert the corresponding data into pure text input.