A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, Xiuying Chen
TL;DR
This work establishes a fundamental information-theoretic limit for single-pass LLM reasoning in multi-hop QA through a Fano-style accuracy bound, revealing an abrupt Accuracy Cliff when task information demand exceeds model capacity. It analyzes MHQA structure to identify Stepwise Capacity Overflow and Cross-Step Error Accumulation as the key failure modes, motivating a capacity-aware, multi-call approach. The authors introduce InfoQA, a multi-call framework with capacity-aware decomposition, dependency-explicit workflows, and iterative query contraction, and validate it on a rigorously constructed synthetic MHQA benchmark. Results show the predicted bound aligns with empirical single-pass performance, while InfoQA consistently improves accuracy, especially on longer contexts and deeper hops, demonstrating a practical route to scalable MHQA with LLMs.
Abstract
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.
