Table of Contents
Fetching ...

A Sentence Cloze Dataset for Chinese Machine Reading Comprehension

Yiming Cui, Ting Liu, Ziqing Yang, Zhipeng Chen, Wentao Ma, Wanxiang Che, Shijin Wang, Guoping Hu

TL;DR

This paper introduces Sentence Cloze-style Machine Reading Comprehension (SC-MRC) and the CMRC 2019 dataset to stress sentence-level inference in Chinese passages. It formalizes the task, describes data collection from Chinese narrative texts with both real and fake candidate sentences, and provides statistics to characterize the dataset. Baseline experiments with BERT- and RoBERTa-based models reveal that current pre-trained models lag behind human performance, particularly in maintaining coherence at the passage level, underscoring the dataset's challenge. The dataset and baselines are released to foster progress in sentence-level reasoning for Chinese MRC.

Abstract

Owing to the continuous efforts by the Chinese NLP community, more and more Chinese machine reading comprehension datasets become available. To add diversity in this area, in this paper, we propose a new task called Sentence Cloze-style Machine Reading Comprehension (SC-MRC). The proposed task aims to fill the right candidate sentence into the passage that has several blanks. We built a Chinese dataset called CMRC 2019 to evaluate the difficulty of the SC-MRC task. Moreover, to add more difficulties, we also made fake candidates that are similar to the correct ones, which requires the machine to judge their correctness in the context. The proposed dataset contains over 100K blanks (questions) within over 10K passages, which was originated from Chinese narrative stories. To evaluate the dataset, we implement several baseline systems based on the pre-trained models, and the results show that the state-of-the-art model still underperforms human performance by a large margin. We release the dataset and baseline system to further facilitate our community. Resources available through https://github.com/ymcui/cmrc2019

A Sentence Cloze Dataset for Chinese Machine Reading Comprehension

TL;DR

This paper introduces Sentence Cloze-style Machine Reading Comprehension (SC-MRC) and the CMRC 2019 dataset to stress sentence-level inference in Chinese passages. It formalizes the task, describes data collection from Chinese narrative texts with both real and fake candidate sentences, and provides statistics to characterize the dataset. Baseline experiments with BERT- and RoBERTa-based models reveal that current pre-trained models lag behind human performance, particularly in maintaining coherence at the passage level, underscoring the dataset's challenge. The dataset and baselines are released to foster progress in sentence-level reasoning for Chinese MRC.

Abstract

Owing to the continuous efforts by the Chinese NLP community, more and more Chinese machine reading comprehension datasets become available. To add diversity in this area, in this paper, we propose a new task called Sentence Cloze-style Machine Reading Comprehension (SC-MRC). The proposed task aims to fill the right candidate sentence into the passage that has several blanks. We built a Chinese dataset called CMRC 2019 to evaluate the difficulty of the SC-MRC task. Moreover, to add more difficulties, we also made fake candidates that are similar to the correct ones, which requires the machine to judge their correctness in the context. The proposed dataset contains over 100K blanks (questions) within over 10K passages, which was originated from Chinese narrative stories. To evaluate the dataset, we implement several baseline systems based on the pre-trained models, and the results show that the state-of-the-art model still underperforms human performance by a large margin. We release the dataset and baseline system to further facilitate our community. Resources available through https://github.com/ymcui/cmrc2019

Paper Structure

This paper contains 13 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Examples of the proposed CMRC 2019 dataset. The candidate with underline means it is a fake candidate (does not belong to any blank). For clarity, we also provide an English example.