Dataset for the First Evaluation on Chinese Machine Reading Comprehension
Yiming Cui, Ting Liu, Zhipeng Chen, Wentao Ma, Shijin Wang, Guoping Hu
TL;DR
CMRC-2017 introduces a Chinese MRC dataset with two tracks, combining large-scale automatically generated training data and human-annotated validation/test sets. It formalizes the cloze task as the triple $\\langle \\mathcal{D}, \\mathcal{Q}, \\mathcal{A} \\rangle$, with $\\mathcal{A}$ a single word from the document, and includes a transfer-learning-focused user-query track. Baselines include Random Guess, Top Frequency, AS Reader, and AoA Reader, with AoA achieving strong results on the cloze track, while the user-query track exposes a domain-adaptation gap. Releasing the full dataset and evaluation framework aims to accelerate Chinese MRC research and enable fair cross-system comparisons.
Abstract
Machine Reading Comprehension (MRC) has become enormously popular recently and has attracted a lot of attention. However, existing reading comprehension datasets are mostly in English. To add diversity in reading comprehension datasets, in this paper we propose a new Chinese reading comprehension dataset for accelerating related research in the community. The proposed dataset contains two different types: cloze-style reading comprehension and user query reading comprehension, associated with large-scale training data as well as human-annotated validation and hidden test set. Along with this dataset, we also hosted the first Evaluation on Chinese Machine Reading Comprehension (CMRC-2017) and successfully attracted tens of participants, which suggest the potential impact of this dataset.
