Table of Contents
Fetching ...

LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning

Denys Pushkin, Emmanuel Abbe

TL;DR

By incorporating short-horizon future validation and aggregating overlapping rollouts, LEAD provides enough isolation to maintain stability while retaining enough local context to correct errors.

Abstract

Long-horizon execution in Large Language Models (LLMs) remains unstable even when high-level strategies are provided. Evaluating on controlled algorithmic puzzles, we demonstrate that while decomposition is essential for stability, extreme decomposition creates a "no-recovery bottleneck". We show that this bottleneck becomes critical due to highly non-uniform error distribution, where consistent errors on a few "hard" steps become irreversible. To address this, we propose Lookahead-Enhanced Atomic Decomposition (LEAD). By incorporating short-horizon future validation and aggregating overlapping rollouts, LEAD provides enough isolation to maintain stability while retaining enough local context to correct errors. This enables the o4-mini model to solve Checkers Jumping up to complexity $n=13$, whereas extreme decomposition fails beyond $n=11$.

LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning

TL;DR

By incorporating short-horizon future validation and aggregating overlapping rollouts, LEAD provides enough isolation to maintain stability while retaining enough local context to correct errors.

Abstract

Long-horizon execution in Large Language Models (LLMs) remains unstable even when high-level strategies are provided. Evaluating on controlled algorithmic puzzles, we demonstrate that while decomposition is essential for stability, extreme decomposition creates a "no-recovery bottleneck". We show that this bottleneck becomes critical due to highly non-uniform error distribution, where consistent errors on a few "hard" steps become irreversible. To address this, we propose Lookahead-Enhanced Atomic Decomposition (LEAD). By incorporating short-horizon future validation and aggregating overlapping rollouts, LEAD provides enough isolation to maintain stability while retaining enough local context to correct errors. This enables the o4-mini model to solve Checkers Jumping up to complexity , whereas extreme decomposition fails beyond .
Paper Structure (31 sections, 1 equation, 11 figures, 2 tables, 1 algorithm)

This paper contains 31 sections, 1 equation, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Inference-time curriculum prompting on Checkers Jumping (asking the model to solve warm-up $n=2$ and then target task $n=N$ within a single response) underperforms direct prompting for $n=N$. For o4-mini and GPT-5.2 models, high reasoning effort was used.
  • Figure 2: Comparison of methods across different problem domains
  • Figure 3: Histogram of estimated per-step error probabilities on Checkers Jumping ($n=15$). Bars indicate the number of steps with a given error probability. The y-axis is shown on a logarithmic scale.
  • Figure 4: Atomic Decomposition with voting approaches the atomic competence barrier on Checkers Jumping, where the limiting factor is atomic move execution. The plot compares accuracy of full-step execution (selecting and executing a move) and isolated move selection or move execution as a function of problem size $n$. Results are shown for the o4-mini model using majority voting over 32 independently sampled solutions at each step.
  • Figure 5: Heatmap of pairwise comparisons between error distributions on Checkers Jumping ($n=13$) across models, measured using Total Variation (TV) distance. Higher values indicate more dissimilar error distributions. For DeepSeek, only a single estimate is available, so self-comparison is not reported.
  • ...and 6 more figures