Table of Contents
Fetching ...

How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks

Divyansh Kaushik, Zachary C. Lipton

TL;DR

The paper critically evaluates popular reading comprehension benchmarks by systematically decoupling question and passage information through Q-only and P-only baselines and corrupt data. It demonstrates that several datasets, notably bAbI and CBT, can be solved largely with limited or even single-sentence context, challenging assumptions about deep, joint reasoning. SQuAD and CNN are highlighted as more carefully designed benchmarks that resist easy Q-only or P-only shortcuts, underscoring the need for rigorous baselines and ablations. The authors advocate for reporting both full-task and input-ablation performances and caution against dataset construction practices that obscure the true cognitive demands of RC tasks. Overall, the work calls for heightened empirical rigor and transparency to better align benchmark progress with genuine reasoning capabilities.

Abstract

Many recent papers address reading comprehension, where examples consist of (question, passage, answer) tuples. Presumably, a model must combine information from both questions and passages to predict corresponding answers. However, despite intense interest in the topic, with hundreds of published papers vying for leaderboard dominance, basic questions about the difficulty of many popular benchmarks remain unanswered. In this paper, we establish sensible baselines for the bAbI, SQuAD, CBT, CNN, and Who-did-What datasets, finding that question- and passage-only models often perform surprisingly well. On $14$ out of $20$ bAbI tasks, passage-only models achieve greater than $50\%$ accuracy, sometimes matching the full model. Interestingly, while CBT provides $20$-sentence stories only the last is needed for comparably accurate prediction. By comparison, SQuAD and CNN appear better-constructed.

How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks

TL;DR

The paper critically evaluates popular reading comprehension benchmarks by systematically decoupling question and passage information through Q-only and P-only baselines and corrupt data. It demonstrates that several datasets, notably bAbI and CBT, can be solved largely with limited or even single-sentence context, challenging assumptions about deep, joint reasoning. SQuAD and CNN are highlighted as more carefully designed benchmarks that resist easy Q-only or P-only shortcuts, underscoring the need for rigorous baselines and ablations. The authors advocate for reporting both full-task and input-ablation performances and caution against dataset construction practices that obscure the true cognitive demands of RC tasks. Overall, the work calls for heightened empirical rigor and transparency to better align benchmark progress with genuine reasoning capabilities.

Abstract

Many recent papers address reading comprehension, where examples consist of (question, passage, answer) tuples. Presumably, a model must combine information from both questions and passages to predict corresponding answers. However, despite intense interest in the topic, with hundreds of published papers vying for leaderboard dominance, basic questions about the difficulty of many popular benchmarks remain unanswered. In this paper, we establish sensible baselines for the bAbI, SQuAD, CBT, CNN, and Who-did-What datasets, finding that question- and passage-only models often perform surprisingly well. On out of bAbI tasks, passage-only models achieve greater than accuracy, sometimes matching the full model. Interestingly, while CBT provides -sentence stories only the last is needed for comparably accurate prediction. By comparison, SQuAD and CNN appear better-constructed.

Paper Structure

This paper contains 23 sections, 4 tables.