Table of Contents
Fetching ...

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

Pradeep Dasigi, Nelson F. Liu, Ana Marasović, Noah A. Smith, Matt Gardner

TL;DR

Quoref introduces a large-scale benchmark of over 24K span-selection questions requiring coreference resolution across 4.7K Wikipedia paragraphs. The dataset is collected via an adversarial crowdsourcing loop to minimize surface cues, ensuring questions depend on genuine coreferential reasoning. Manual analysis shows that about 78% of questions require coreference, and state-of-the-art RC models reach at most 70.5 $F_1$ versus an estimated human 93.4 $F_1$, revealing a substantial gap. The work provides a targeted resource for evaluating coreference-aware reading comprehension and highlights directions for improving long-range entity tracking in neural models.

Abstract

Machine comprehension of texts longer than a single sentence often requires coreference resolution. However, most current reading comprehension benchmarks do not contain complex coreferential phenomena and hence fail to evaluate the ability of models to resolve coreference. We present a new crowdsourced dataset containing more than 24K span-selection questions that require resolving coreference among entities in over 4.7K English paragraphs from Wikipedia. Obtaining questions focused on such phenomena is challenging, because it is hard to avoid lexical cues that shortcut complex reasoning. We deal with this issue by using a strong baseline model as an adversary in the crowdsourcing loop, which helps crowdworkers avoid writing questions with exploitable surface cues. We show that state-of-the-art reading comprehension models perform significantly worse than humans on this benchmark---the best model performance is 70.5 F1, while the estimated human performance is 93.4 F1.

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

TL;DR

Quoref introduces a large-scale benchmark of over 24K span-selection questions requiring coreference resolution across 4.7K Wikipedia paragraphs. The dataset is collected via an adversarial crowdsourcing loop to minimize surface cues, ensuring questions depend on genuine coreferential reasoning. Manual analysis shows that about 78% of questions require coreference, and state-of-the-art RC models reach at most 70.5 versus an estimated human 93.4 , revealing a substantial gap. The work provides a targeted resource for evaluating coreference-aware reading comprehension and highlights directions for improving long-range entity tracking in neural models.

Abstract

Machine comprehension of texts longer than a single sentence often requires coreference resolution. However, most current reading comprehension benchmarks do not contain complex coreferential phenomena and hence fail to evaluate the ability of models to resolve coreference. We present a new crowdsourced dataset containing more than 24K span-selection questions that require resolving coreference among entities in over 4.7K English paragraphs from Wikipedia. Obtaining questions focused on such phenomena is challenging, because it is hard to avoid lexical cues that shortcut complex reasoning. We deal with this issue by using a strong baseline model as an adversary in the crowdsourcing loop, which helps crowdworkers avoid writing questions with exploitable surface cues. We show that state-of-the-art reading comprehension models perform significantly worse than humans on this benchmark---the best model performance is 70.5 F1, while the estimated human performance is 93.4 F1.

Paper Structure

This paper contains 25 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Example paragraph and questions from the dataset. Highlighted text in paragraphs is where the questions with matching highlights are anchored. Next to the questions are the relevant coreferent mentions from the paragraph. They are bolded for the first question, italicized for the second, and underlined for the third in the paragraph.