Table of Contents
Fetching ...

Compositional Questions Do Not Necessitate Multi-hop Reasoning

Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, Luke Zettlemoyer

TL;DR

This paper questions the assumption that compositional questions inherently require multi-hop reasoning by showing a strong single-hop baseline on HotpotQA. It presents a single-paragraph BERT-based QA approach that treats each paragraph independently and selects the final answer from the best paragraph, achieving competitive results in distractor settings and revealing retrieval as the main bottleneck in open-domain evaluation. Through detailed human analysis and experiments with adversarial and type-based distractors, the authors highlight how evidence quality and distractor design influence whether questions truly require multi-hop reasoning. They argue for evidence-centric evaluation and retrieval-focused research to advance multi-hop QA beyond mere compositionality, and they outline open challenges in open-domain settings and distractor construction.

Abstract

Multi-hop reading comprehension (RC) questions are challenging because they require reading and reasoning over multiple paragraphs. We argue that it can be difficult to construct large multi-hop RC datasets. For example, even highly compositional questions can be answered with a single hop if they target specific entity types, or the facts needed to answer them are redundant. Our analysis is centered on HotpotQA, where we show that single-hop reasoning can solve much more of the dataset than previously thought. We introduce a single-hop BERT-based RC model that achieves 67 F1---comparable to state-of-the-art multi-hop models. We also design an evaluation setting where humans are not shown all of the necessary paragraphs for the intended multi-hop reasoning but can still answer over 80% of questions. Together with detailed error analysis, these results suggest there should be an increasing focus on the role of evidence in multi-hop reasoning and possibly even a shift towards information retrieval style evaluations with large and diverse evidence collections.

Compositional Questions Do Not Necessitate Multi-hop Reasoning

TL;DR

This paper questions the assumption that compositional questions inherently require multi-hop reasoning by showing a strong single-hop baseline on HotpotQA. It presents a single-paragraph BERT-based QA approach that treats each paragraph independently and selects the final answer from the best paragraph, achieving competitive results in distractor settings and revealing retrieval as the main bottleneck in open-domain evaluation. Through detailed human analysis and experiments with adversarial and type-based distractors, the authors highlight how evidence quality and distractor design influence whether questions truly require multi-hop reasoning. They argue for evidence-centric evaluation and retrieval-focused research to advance multi-hop QA beyond mere compositionality, and they outline open challenges in open-domain settings and distractor construction.

Abstract

Multi-hop reading comprehension (RC) questions are challenging because they require reading and reasoning over multiple paragraphs. We argue that it can be difficult to construct large multi-hop RC datasets. For example, even highly compositional questions can be answered with a single hop if they target specific entity types, or the facts needed to answer them are redundant. Our analysis is centered on HotpotQA, where we show that single-hop reasoning can solve much more of the dataset than previously thought. We introduce a single-hop BERT-based RC model that achieves 67 F1---comparable to state-of-the-art multi-hop models. We also design an evaluation setting where humans are not shown all of the necessary paragraphs for the intended multi-hop reasoning but can still answer over 80% of questions. Together with detailed error analysis, these results suggest there should be an increasing focus on the role of evidence in multi-hop reasoning and possibly even a shift towards information retrieval style evaluations with large and diverse evidence collections.

Paper Structure

This paper contains 36 sections, 7 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: A HotpotQA example designed to require reasoning across two paragraphs. Eight spurious additional paragraphs (not shown) are provided to increase the task difficulty. However, since only one of the ten paragraphs is about an animal, one can immediately locate the answer in Paragraph 1 using one hop. The full example is provided in Appendix \ref{['app:full-example']}.
  • Figure 2: Our model, single-paragraph BERT, reads and scores each paragraph independently. The answer from the paragraph with the lowest $y_\mathrm{empty}$ score is chosen as the final answer.
  • Figure 3: Single-paragraph BERT reads and scores each paragraph independently. The answer from the paragraph with the lowest $y^\mathrm{empty}$ score is chosen as the final answer.