Table of Contents
Fetching ...

A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA

Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, Xiuying Chen

TL;DR

This work establishes a fundamental information-theoretic limit for single-pass LLM reasoning in multi-hop QA through a Fano-style accuracy bound, revealing an abrupt Accuracy Cliff when task information demand exceeds model capacity. It analyzes MHQA structure to identify Stepwise Capacity Overflow and Cross-Step Error Accumulation as the key failure modes, motivating a capacity-aware, multi-call approach. The authors introduce InfoQA, a multi-call framework with capacity-aware decomposition, dependency-explicit workflows, and iterative query contraction, and validate it on a rigorously constructed synthetic MHQA benchmark. Results show the predicted bound aligns with empirical single-pass performance, while InfoQA consistently improves accuracy, especially on longer contexts and deeper hops, demonstrating a practical route to scalable MHQA with LLMs.

Abstract

Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.

A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA

TL;DR

This work establishes a fundamental information-theoretic limit for single-pass LLM reasoning in multi-hop QA through a Fano-style accuracy bound, revealing an abrupt Accuracy Cliff when task information demand exceeds model capacity. It analyzes MHQA structure to identify Stepwise Capacity Overflow and Cross-Step Error Accumulation as the key failure modes, motivating a capacity-aware, multi-call approach. The authors introduce InfoQA, a multi-call framework with capacity-aware decomposition, dependency-explicit workflows, and iterative query contraction, and validate it on a rigorously constructed synthetic MHQA benchmark. Results show the predicted bound aligns with empirical single-pass performance, while InfoQA consistently improves accuracy, especially on longer contexts and deeper hops, demonstrating a practical route to scalable MHQA with LLMs.

Abstract

Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.

Paper Structure

This paper contains 49 sections, 1 theorem, 46 equations, 6 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

For any single-pass, closed-book policy, let $A \in \mathcal{A}$ be the ground-truth answer. Define the task’s information demand as $\beta \triangleq H(A \mid Q,C)$ and the model’s output capacity as $C \triangleq H(Y)$. The maximum achievable accuracy, $Acc = 1 - P_e$, is implicitly bounded by the where $h(\cdot)$ denotes the binary entropy function and $h(Acc) = h(1-P_e)$.

Figures (6)

  • Figure 1: Comparison of single-pass and multi-call reasoning paradigms. Single-pass reasoning is constrained by the limited output capacity of LLMs, making it difficult to solve long-context and multi-hop problems. Multi-call reasoning mitigates this by decomposing tasks into sequentially dependent sub-steps, ensuring high per-step accuracy and a reliable reasoning chain.
  • Figure 2: The Accuracy Cliff. The theoretical upper bound on accuracy is plotted against information demand $\beta$, using $C=200$ as an illustrative example. Once $\beta > C + 1$, the accuracy declines sharply.
  • Figure 3: Error Accumulation. Even a small per-step error rate ($\varepsilon$) causes a rapid decay in overall success probability as the number of hops ($K$) increases.
  • Figure 4: The InfoQA framework integrates three key components: (1) Capacity-Aware Task Decomposition, which reduces the information demand by generating single-hop sub-questions; (2) Dependency-Explicit Workflow, where the evolving contracted query carries the reasoning state across steps; and (3) Iterative Query Contraction, which prunes reasoning traces and rewrites the query with $\hat{Z}_k$. Each LLM call approximates $\phi_k$ and produces $\hat{Z}_k$.
  • Figure 5: Qwen3-14B F1 vs. theoretical curves across single-pass methods. The x-axis shows the estimated effective information demand ($\beta$), fitted per method, and the y-axis shows the F1 score.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1: A Fano-Style Accuracy Upper Bound for Single-Pass Reasoning