Table of Contents
Fetching ...

Limited Reasoning Space: The cage of long-horizon reasoning in LLMs

Zhenyu Li, Guanlin Wu, Cheems Wang, Yongqiang Zhao

TL;DR

Halo is a model predictive control framework for LLM planning designed for long-horizon tasks with reason-based planning and crafts an entropy-driven dual controller, which adopts a Measure-then-Plan strategy to achieve controllable reasoning.

Abstract

The test-time compute strategy, such as Chain-of-Thought (CoT), has significantly enhanced the ability of large language models to solve complex tasks like logical reasoning. However, empirical studies indicate that simply increasing the compute budget can sometimes lead to a collapse in test-time performance when employing typical task decomposition strategies such as CoT. This work hypothesizes that reasoning failures with larger compute budgets stem from static planning methods, which hardly perceive the intrinsic boundaries of LLM reasoning. We term it as the Limited Reasoning Space hypothesis and perform theoretical analysis through the lens of a non-autonomous stochastic dynamical system. This insight suggests that there is an optimal range for compute budgets; over-planning can lead to redundant feedback and may even impair reasoning capabilities. To exploit the compute-scaling benefits and suppress over-planning, this work proposes Halo, a model predictive control framework for LLM planning. Halo is designed for long-horizon tasks with reason-based planning and crafts an entropy-driven dual controller, which adopts a Measure-then-Plan strategy to achieve controllable reasoning. Experimental results demonstrate that Halo outperforms static baselines on complex long-horizon tasks by dynamically regulating planning at the reasoning boundary.

Limited Reasoning Space: The cage of long-horizon reasoning in LLMs

TL;DR

Halo is a model predictive control framework for LLM planning designed for long-horizon tasks with reason-based planning and crafts an entropy-driven dual controller, which adopts a Measure-then-Plan strategy to achieve controllable reasoning.

Abstract

The test-time compute strategy, such as Chain-of-Thought (CoT), has significantly enhanced the ability of large language models to solve complex tasks like logical reasoning. However, empirical studies indicate that simply increasing the compute budget can sometimes lead to a collapse in test-time performance when employing typical task decomposition strategies such as CoT. This work hypothesizes that reasoning failures with larger compute budgets stem from static planning methods, which hardly perceive the intrinsic boundaries of LLM reasoning. We term it as the Limited Reasoning Space hypothesis and perform theoretical analysis through the lens of a non-autonomous stochastic dynamical system. This insight suggests that there is an optimal range for compute budgets; over-planning can lead to redundant feedback and may even impair reasoning capabilities. To exploit the compute-scaling benefits and suppress over-planning, this work proposes Halo, a model predictive control framework for LLM planning. Halo is designed for long-horizon tasks with reason-based planning and crafts an entropy-driven dual controller, which adopts a Measure-then-Plan strategy to achieve controllable reasoning. Experimental results demonstrate that Halo outperforms static baselines on complex long-horizon tasks by dynamically regulating planning at the reasoning boundary.
Paper Structure (42 sections, 1 theorem, 24 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 42 sections, 1 theorem, 24 equations, 7 figures, 8 tables, 1 algorithm.

Key Result

Proposition 2.3

For a reasoning process with average Lyapunov exponent $\lambda > 0$ and sampling noise variance $\sigma^2$, the maximum effective length $N^*$ is bounded by:

Figures (7)

  • Figure 1: The Universality of the performance degradation across Model Scales. We evaluate seven state-of-the-art LLMs (from 7B to o1-mini) on the Omni-MATH benchmark. (Left) Width-Driven: As the number of parallel reasoning paths ($k$) increases in CoT-SC, performance initially peaks due to ensemble benefits but eventually degrades as noise dominates the majority vote. (Right) Depth-Driven: Similarly, in Tree-of-Thoughts (ToT), extending search depth ($d$) beyond the limited reasoning space leads to catastrophic error propagation.
  • Figure 2: The Perils of Over-Reasoning: A Case Study. While moderate reasoning yields the correct answer (Left), extending the chain beyond the limited reasoning space introduces noise (Step 4). This noise is amplified through recursive feedback, leading to hallucination (Right). Our framework, Halo (Bottom Center), detects such drift and resets the state to maintain logical stability.
  • Figure 3: The Halo Framework: Closing the Loop on Reasoning Dynamics.Top: Standard Chain-of-Thought operates in an open-loop regime, where stochastic errors accumulate unchecked until the trajectory suffers from severe degradation. Bottom: Halo introduces a Model Predictive Control (MPC) loop comprising three stages: (1) The Observer (Sec. \ref{['subsec:entropy_bridge']}) estimates the instantaneous drift rate $\hat{\lambda}_t$ via mean attention entropy $\mathcal{H}$; (2) The Controller (Sec. \ref{['subsec:control_law']}) tracks the accumulated uncertainty$\Omega_t$ against a Tolerance Threshold $\Psi$; (3) The Actuator (Sec. \ref{['subsec:mechanisms']}) executes a Trajectory Rectification via semantic compression and history reset, projecting the system back to stable space and resetting the uncertainty score $\Omega_t \leftarrow 0$.
  • Figure 4: Mechanistic Insights into Reasoning Stability on Omni-MATH.(a) Trajectory Projection: t-SNE visualization of hidden states $\mathbf{s}_t$. Standard CoT (Red) diverges into the high-entropy regime, while Halo (Blue) maintains logical anchoring via periodic trajectory rectification (marked by $\star$). (b) Phase Transition: Reasoning accuracy exhibits a sharp collapse as cumulative uncertainty breaches the threshold $\Psi$. Halo preemptively acts to avoid this regime.
  • Figure 5: Distributional Alignment of Reasoning Boundaries. We compare the step index of baseline failures (Red) and Halo interventions (Blue) across seven reasoning domains. The tight clustering of data points indicates that Halo consistently identifies the critical reasoning horizon with low variance.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Proposition 2.3: Maximum Effective Reasoning Length