Table of Contents
Fetching ...

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai

TL;DR

R-Horizon identifies a gap in evaluating and training LRMs for long-horizon reasoning and proposes a practical method to generate interdependent, multi-step tasks through query composition. It establishes a six-task benchmark across mathematics, code, and agentic domains and demonstrates substantial performance degradation of current LRMs as reasoning horizons lengthen. By reconstructing training data into composed, dependent problems and applying RLVR with GRPO, the approach yields notable gains on multi-horizon tasks and transfer improvements on single-horizon tasks, while also enabling better token-budget management. Overall, R-Horizon offers a scalable, low-cost framework for both evaluating and enhancing long-horizon reasoning capabilities in LRMs with broad practical impact on complex reasoning systems.

Abstract

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

TL;DR

R-Horizon identifies a gap in evaluating and training LRMs for long-horizon reasoning and proposes a practical method to generate interdependent, multi-step tasks through query composition. It establishes a six-task benchmark across mathematics, code, and agentic domains and demonstrates substantial performance degradation of current LRMs as reasoning horizons lengthen. By reconstructing training data into composed, dependent problems and applying RLVR with GRPO, the approach yields notable gains on multi-horizon tasks and transfer improvements on single-horizon tasks, while also enabling better token-budget management. Overall, R-Horizon offers a scalable, low-cost framework for both evaluating and enhancing long-horizon reasoning capabilities in LRMs with broad practical impact on complex reasoning systems.

Abstract

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.

Paper Structure

This paper contains 55 sections, 7 equations, 22 figures, 4 tables, 1 algorithm.

Figures (22)

  • Figure 1: Actual versus theoretical accuracy of R1-series models on R-Horizon datasets.
  • Figure 2: The R-Horizon data composition pipeline is illustrated in (a)-(c). We leverage R-Horizon to construct a comprehensive long-horizon reasoning evaluation benchmark spanning 6 tasks and generate multi-horizon training data for long-horizon reinforcement learning.
  • Figure 3: Evaluation results of R-Horizon Benchmark.
  • Figure 4: Training curves comparing single and composed data on $\text{AIME24}_\text{avg@8}$ and reward.
  • Figure 5: Error type distribution across different query numbers. Four error categories: Problem Reasoning Error represents reasoning errors made by the model for specific problems; Dependency Reasoning Error indicates the model correctly solved previous problems but made errors when calculating the dependencies; Early Stop indicates the model prematurely terminated generation after solving previous problems; Output Truncation indicates generation exceeded token limit.
  • ...and 17 more figures