Table of Contents
Fetching ...

Who did What: A Large-Scale Person-Centered Cloze Dataset

Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, David McAllester

TL;DR

Who-did-What introduces a large-scale, two-article cloze reading comprehension dataset built from the Gigaword corpus. Questions remove a person named entity from a question article, while the answer is sourced from a separate, related passage, with non-anonymized choices to preserve realistic cues. The construction pipeline includes named-entity recognition, syntactic parsing, cross-article retrieval, and systematic suppression of simple baselines, yielding over 200k problems with 84% human solvability. Empirical benchmarks show neural readers underperforming relative to humans and relative to prior datasets, confirming WDW as a harder, more semantically demanding benchmark that can drive advances in deep understanding. The work provides a scalable resource and methodological framework for evaluating and improving machine reading comprehension models.

Abstract

We have constructed a new "Who-did-What" dataset of over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus. The WDW dataset has a variety of novel features. First, in contrast with the CNN and Daily Mail datasets (Hermann et al., 2015) we avoid using article summaries for question formation. Instead, each problem is formed from two independent articles --- an article given as the passage to be read and a separate article on the same events used to form the question. Second, we avoid anonymization --- each choice is a person named entity. Third, the problems have been filtered to remove a fraction that are easily solved by simple baselines, while remaining 84% solvable by humans. We report performance benchmarks of standard systems and propose the WDW dataset as a challenge task for the community.

Who did What: A Large-Scale Person-Centered Cloze Dataset

TL;DR

Who-did-What introduces a large-scale, two-article cloze reading comprehension dataset built from the Gigaword corpus. Questions remove a person named entity from a question article, while the answer is sourced from a separate, related passage, with non-anonymized choices to preserve realistic cues. The construction pipeline includes named-entity recognition, syntactic parsing, cross-article retrieval, and systematic suppression of simple baselines, yielding over 200k problems with 84% human solvability. Empirical benchmarks show neural readers underperforming relative to humans and relative to prior datasets, confirming WDW as a harder, more semantically demanding benchmark that can drive advances in deep understanding. The work provides a scalable resource and methodological framework for evaluating and improving machine reading comprehension models.

Abstract

We have constructed a new "Who-did-What" dataset of over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus. The WDW dataset has a variety of novel features. First, in contrast with the CNN and Daily Mail datasets (Hermann et al., 2015) we avoid using article summaries for question formation. Instead, each problem is formed from two independent articles --- an article given as the passage to be read and a separate article on the same events used to form the question. Second, we avoid anonymization --- each choice is a person named entity. Third, the problems have been filtered to remove a fraction that are easily solved by simple baselines, while remaining 84% solvable by humans. We report performance benchmarks of standard systems and propose the WDW dataset as a challenge task for the community.

Paper Structure

This paper contains 6 sections, 2 equations, 4 tables, 2 algorithms.