Table of Contents
Fetching ...

MARCH: Evaluating the Intersection of Ambiguity Interpretation and Multi-hop Inference

Jeonghyun Park, Ingeol Baek, Seunghyun Yoon, Haeun Jang, Aparna Garimella, Akriti Jain, Nedim Lipka, Hwanhee Lee

TL;DR

This work targets the intersection of ambiguity interpretation and multi-hop reasoning in question answering by introducing MARCH, a 2,209-example benchmark derived from MuSiQue with type-specific clarifications and evidence-grounded long answers validated by humans. It then proposes CLARION, a two-stage framework that explicitly separates ambiguity planning from evidence-driven acting, enabling per-interpretation retrieval and hop-consistent reasoning. Empirical results show that state-of-the-art models struggle on MARCH, with CLARION delivering substantial improvements and highlighting the importance of explicit disambiguation to mitigate path-dependent errors. The paper also analyzes dataset quality, annotation reliability, and ablation studies, offering insights into how different ambiguity types impact multi-hop reasoning and providing a foundation for future robust reasoning systems.

Abstract

Real-world multi-hop QA is naturally linked with ambiguity, where a single query can trigger multiple reasoning paths that require independent resolution. Since ambiguity can occur at any stage, models must navigate layered uncertainty throughout the entire reasoning chain. Despite its prevalence in real-world user queries, previous benchmarks have primarily focused on single-hop ambiguity, leaving the complex interaction between multi-step inference and layered ambiguity underexplored. In this paper, we introduce \textbf{MARCH}, a benchmark for their intersection, with 2,209 multi-hop ambiguous questions curated via multi-LLM verification and validated by human annotation with strong agreement. Our experiments reveal that even state-of-the-art models struggle with MARCH, confirming that combining ambiguity resolution with multi-step reasoning is a significant challenge. To address this, we propose \textbf{CLARION}, a two-stage agentic framework that explicitly decouples ambiguity planning from evidence-driven reasoning, significantly outperforms existing approaches, and paves the way for robust reasoning systems.

MARCH: Evaluating the Intersection of Ambiguity Interpretation and Multi-hop Inference

TL;DR

This work targets the intersection of ambiguity interpretation and multi-hop reasoning in question answering by introducing MARCH, a 2,209-example benchmark derived from MuSiQue with type-specific clarifications and evidence-grounded long answers validated by humans. It then proposes CLARION, a two-stage framework that explicitly separates ambiguity planning from evidence-driven acting, enabling per-interpretation retrieval and hop-consistent reasoning. Empirical results show that state-of-the-art models struggle on MARCH, with CLARION delivering substantial improvements and highlighting the importance of explicit disambiguation to mitigate path-dependent errors. The paper also analyzes dataset quality, annotation reliability, and ablation studies, offering insights into how different ambiguity types impact multi-hop reasoning and providing a foundation for future robust reasoning systems.

Abstract

Real-world multi-hop QA is naturally linked with ambiguity, where a single query can trigger multiple reasoning paths that require independent resolution. Since ambiguity can occur at any stage, models must navigate layered uncertainty throughout the entire reasoning chain. Despite its prevalence in real-world user queries, previous benchmarks have primarily focused on single-hop ambiguity, leaving the complex interaction between multi-step inference and layered ambiguity underexplored. In this paper, we introduce \textbf{MARCH}, a benchmark for their intersection, with 2,209 multi-hop ambiguous questions curated via multi-LLM verification and validated by human annotation with strong agreement. Our experiments reveal that even state-of-the-art models struggle with MARCH, confirming that combining ambiguity resolution with multi-step reasoning is a significant challenge. To address this, we propose \textbf{CLARION}, a two-stage agentic framework that explicitly decouples ambiguity planning from evidence-driven reasoning, significantly outperforms existing approaches, and paves the way for robust reasoning systems.

Paper Structure

This paper contains 53 sections, 2 equations, 17 figures, 15 tables.

Figures (17)

  • Figure 1: An example of multi-hop ambiguity QA. The ambiguity of the second hop ("pickup") is latent; it is only detectable if the alternative interpretation of the first hop ("Mustang" as guitar) is preserved.
  • Figure 2: Multi-hop ambiguity prevalence (top) and performance drops (bottom).
  • Figure 3: Overview of the four-stage MARCH dataset construction pipeline.
  • Figure 4: Overview of our CLARION framework. A Planning Agent resolves ambiguity, and an Acting Agent executes a ReAct loop to generate the final answer.
  • Figure 5: Correlation between LLM and human judgments.
  • ...and 12 more figures