Table of Contents
Fetching ...

h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning

Sumeet Ramesh Motwani, Alesia Ivanova, Ziyang Cai, Philip Torr, Riashat Islam, Shital Shah, Christian Schroeder de Witt, Charles London

TL;DR

The paper tackles long-horizon reasoning in large language models by bootstrapping from abundant short-horizon data. It introduces a compositional data method that chains simple problems into longer, dependent tasks and trains via curriculum RL with outcome-only rewards, avoiding step-level supervision. Theoretical analysis shows an exponential improvement in sample complexity for curriculum over direct full-horizon training, while empirical results demonstrate strong transfer to harder math benchmarks, long-context tasks, and diverse reasoning domains. This approach offers a scalable, data-efficient path to expanding LLMs' long-horizon capabilities with no additional annotations and broad practical impact on reasoning tasks and beyond.

Abstract

Large language models excel at short-horizon reasoning tasks, but performance drops as reasoning horizon lengths increase. Existing approaches to combat this rely on inference-time scaffolding or costly step-level supervision, neither of which scales easily. In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multi-step dependency chains of arbitrary length. We train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating. Empirically, our method generalizes remarkably well: curriculum training on composed 6th-grade level math problems (GSM8K) boosts accuracy on longer, competition-level benchmarks (GSM-Symbolic, MATH-500, AIME) by up to 2.06x. It also transfers significantly to diverse out-of-distribution ReasoningGym domains and long-context benchmarks, indicating broader generalization. Importantly, our long-horizon improvements are significantly higher than baselines even at high pass@k, showing that models can learn new reasoning paths under RL. Theoretically, we show that curriculum RL with outcome rewards achieves an exponential improvement in sample complexity over full-horizon training, providing training signal comparable to dense supervision. h1 therefore introduces an efficient path towards scaling RL for long-horizon problems using only existing data.

h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning

TL;DR

The paper tackles long-horizon reasoning in large language models by bootstrapping from abundant short-horizon data. It introduces a compositional data method that chains simple problems into longer, dependent tasks and trains via curriculum RL with outcome-only rewards, avoiding step-level supervision. Theoretical analysis shows an exponential improvement in sample complexity for curriculum over direct full-horizon training, while empirical results demonstrate strong transfer to harder math benchmarks, long-context tasks, and diverse reasoning domains. This approach offers a scalable, data-efficient path to expanding LLMs' long-horizon capabilities with no additional annotations and broad practical impact on reasoning tasks and beyond.

Abstract

Large language models excel at short-horizon reasoning tasks, but performance drops as reasoning horizon lengths increase. Existing approaches to combat this rely on inference-time scaffolding or costly step-level supervision, neither of which scales easily. In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multi-step dependency chains of arbitrary length. We train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating. Empirically, our method generalizes remarkably well: curriculum training on composed 6th-grade level math problems (GSM8K) boosts accuracy on longer, competition-level benchmarks (GSM-Symbolic, MATH-500, AIME) by up to 2.06x. It also transfers significantly to diverse out-of-distribution ReasoningGym domains and long-context benchmarks, indicating broader generalization. Importantly, our long-horizon improvements are significantly higher than baselines even at high pass@k, showing that models can learn new reasoning paths under RL. Theoretically, we show that curriculum RL with outcome rewards achieves an exponential improvement in sample complexity over full-horizon training, providing training signal comparable to dense supervision. h1 therefore introduces an efficient path towards scaling RL for long-horizon problems using only existing data.

Paper Structure

This paper contains 52 sections, 75 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Our method $h1$ involves composing existing short-horizon reasoning problems into longer sequences. We then apply a stage-wise curriculum on this composed data to scale online RL training. We observe significantly improved long-horizon reasoning capabilities and OOD improvements.
  • Figure 2: Curriculum RL training on compositional data offers significant in-domain long horizon reasoning gains (up to $2.9\times$). This prevents RL training from saturating and uses no new data.
  • Figure 3: Our curriculum based RL training using composed synthetic data outperforms RLVR on standard data from the same set even at pass@128, teaching new capabilities that did not previously exist in the base model. LHR requires going beyond improving single-step performance.
  • Figure 4: Long-horizon training on GSM8K generalizes to significantly harder tasks. Performance on AIME 2024 improves by $2.1\times$ and ultra-long-context capabilities improve by $1.2\times$.
  • Figure 5: Left: Sample count distributions for four settings. Middle: Comparing accuracy at each stage across sample count settings. Under mild skew towards shorter samples like Setting 1 and 2, the model can perform as well as the uniform sample baseline. Right: Comparing the training compute across settings. The settings skewed towards shorter samples have more training cost in terms of training tokens seen. Overall, low-cost data distributions can achieve near-optimal performance.
  • ...and 3 more figures