LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning

Denys Pushkin; Emmanuel Abbe

LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning

Denys Pushkin, Emmanuel Abbe

TL;DR

By incorporating short-horizon future validation and aggregating overlapping rollouts, LEAD provides enough isolation to maintain stability while retaining enough local context to correct errors.

Abstract

Long-horizon execution in Large Language Models (LLMs) remains unstable even when high-level strategies are provided. Evaluating on controlled algorithmic puzzles, we demonstrate that while decomposition is essential for stability, extreme decomposition creates a "no-recovery bottleneck". We show that this bottleneck becomes critical due to highly non-uniform error distribution, where consistent errors on a few "hard" steps become irreversible. To address this, we propose Lookahead-Enhanced Atomic Decomposition (LEAD). By incorporating short-horizon future validation and aggregating overlapping rollouts, LEAD provides enough isolation to maintain stability while retaining enough local context to correct errors. This enables the o4-mini model to solve Checkers Jumping up to complexity $n=13$, whereas extreme decomposition fails beyond $n=11$.

LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning

TL;DR

By incorporating short-horizon future validation and aggregating overlapping rollouts, LEAD provides enough isolation to maintain stability while retaining enough local context to correct errors.

Abstract

, whereas extreme decomposition fails beyond

Paper Structure (31 sections, 1 equation, 11 figures, 2 tables, 1 algorithm)

This paper contains 31 sections, 1 equation, 11 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Long-Horizon Reasoning Failures in LLMs
Context Management and Decomposition in LLM Reasoning
Solving Algorithmic Puzzles
Tasks
Methods
Baseline Execution Strategies
Single-shot generation.
Iterative restart generation.
Atomic decomposition.
Limitations of Atomic Execution
Lookahead Mechanism
Lookahead-Enhanced Atomic Decomposition (LEAD)
Experiments
...and 16 more sections

Figures (11)

Figure 1: Inference-time curriculum prompting on Checkers Jumping (asking the model to solve warm-up $n=2$ and then target task $n=N$ within a single response) underperforms direct prompting for $n=N$. For o4-mini and GPT-5.2 models, high reasoning effort was used.
Figure 2: Comparison of methods across different problem domains
Figure 3: Histogram of estimated per-step error probabilities on Checkers Jumping ($n=15$). Bars indicate the number of steps with a given error probability. The y-axis is shown on a logarithmic scale.
Figure 4: Atomic Decomposition with voting approaches the atomic competence barrier on Checkers Jumping, where the limiting factor is atomic move execution. The plot compares accuracy of full-step execution (selecting and executing a move) and isolated move selection or move execution as a function of problem size $n$. Results are shown for the o4-mini model using majority voting over 32 independently sampled solutions at each step.
Figure 5: Heatmap of pairwise comparisons between error distributions on Checkers Jumping ($n=13$) across models, measured using Total Variation (TV) distance. Higher values indicate more dissimilar error distributions. For DeepSeek, only a single estimate is available, so self-comparison is not reported.
...and 6 more figures

LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning

TL;DR

Abstract

LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)