MARCH: Evaluating the Intersection of Ambiguity Interpretation and Multi-hop Inference
Jeonghyun Park, Ingeol Baek, Seunghyun Yoon, Haeun Jang, Aparna Garimella, Akriti Jain, Nedim Lipka, Hwanhee Lee
TL;DR
This work targets the intersection of ambiguity interpretation and multi-hop reasoning in question answering by introducing MARCH, a 2,209-example benchmark derived from MuSiQue with type-specific clarifications and evidence-grounded long answers validated by humans. It then proposes CLARION, a two-stage framework that explicitly separates ambiguity planning from evidence-driven acting, enabling per-interpretation retrieval and hop-consistent reasoning. Empirical results show that state-of-the-art models struggle on MARCH, with CLARION delivering substantial improvements and highlighting the importance of explicit disambiguation to mitigate path-dependent errors. The paper also analyzes dataset quality, annotation reliability, and ablation studies, offering insights into how different ambiguity types impact multi-hop reasoning and providing a foundation for future robust reasoning systems.
Abstract
Real-world multi-hop QA is naturally linked with ambiguity, where a single query can trigger multiple reasoning paths that require independent resolution. Since ambiguity can occur at any stage, models must navigate layered uncertainty throughout the entire reasoning chain. Despite its prevalence in real-world user queries, previous benchmarks have primarily focused on single-hop ambiguity, leaving the complex interaction between multi-step inference and layered ambiguity underexplored. In this paper, we introduce \textbf{MARCH}, a benchmark for their intersection, with 2,209 multi-hop ambiguous questions curated via multi-LLM verification and validated by human annotation with strong agreement. Our experiments reveal that even state-of-the-art models struggle with MARCH, confirming that combining ambiguity resolution with multi-step reasoning is a significant challenge. To address this, we propose \textbf{CLARION}, a two-stage agentic framework that explicitly decouples ambiguity planning from evidence-driven reasoning, significantly outperforms existing approaches, and paves the way for robust reasoning systems.
