Table of Contents
Fetching ...

LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards

Bowen Ping, Zijun Chen, Yiyao Yu, Tingfeng Hui, Junchi Yan, Baobao Chang

TL;DR

This work tackles the challenge of long-context reasoning under reinforcement learning by addressing reward sparsity with a dense, information-theoretic supervision signal. The proposed LongR framework interleaves reasoning with document reading through a Think-and-Read policy and uses a relative information gain based contextual reward computed by a frozen verifier to guide evidence seeking. Empirical results show substantial gains on LongBench v2 and strong generalization to RULER and InfiniteBench across multiple RL algorithms, with ablations confirming the superiority of the Relative Information Gain design. The approach enables efficient long-context reasoning without rigid chunking, improves robustness to distractors, and demonstrates broad applicability to diverse long-context tasks. This has practical implications for building more capable, context-aware agents and tools that operate over large document collections or extended dialogues.

Abstract

Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios--such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model's robustness against distractors.

LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards

TL;DR

This work tackles the challenge of long-context reasoning under reinforcement learning by addressing reward sparsity with a dense, information-theoretic supervision signal. The proposed LongR framework interleaves reasoning with document reading through a Think-and-Read policy and uses a relative information gain based contextual reward computed by a frozen verifier to guide evidence seeking. Empirical results show substantial gains on LongBench v2 and strong generalization to RULER and InfiniteBench across multiple RL algorithms, with ablations confirming the superiority of the Relative Information Gain design. The approach enables efficient long-context reasoning without rigid chunking, improves robustness to distractors, and demonstrates broad applicability to diverse long-context tasks. This has practical implications for building more capable, context-aware agents and tools that operate over large document collections or extended dialogues.

Abstract

Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios--such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model's robustness against distractors.
Paper Structure (30 sections, 10 equations, 11 figures, 7 tables)

This paper contains 30 sections, 10 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Qualitative comparison between Outcome-Driven CoT (Standard RL, left) and Think-and-Read CoT (LongR, right). Left: Without dense supervision, the baseline model tends to rely on superficial query matching while overlooking detailed context reading. Right: LongR incorporates a contextual dense reward. By incentivizing the model to actively consult the context during the reasoning process, it significantly enhances long-context understanding capabilities.
  • Figure 2: Overview of the LongR Framework. The system operates within a standard reinforcement learning pipeline enhanced by three key mechanisms: (1) Curriculum Learning: through a progressive curriculum, models learn to dynamically interleave consultation with reasoning purely via RL, eliminating the need for costly, long-context-specific SFT data engineering. (2) Interleaved Think-and-Read Policy: generating actions as a dynamic stream where reasoning steps (blue) alternate with grounded evidence extraction (orange), enabling flexible information seeking without rigid chunking. (3) Dense Utility Supervision: providing additional fine-grained feedback by quantifying the utility of each extracted document via relative information gain.
  • Figure 3: Reasoning chain length comparison for 8B models.
  • Figure 4: Reasoning chain length comparison for 4B models.
  • Figure 5: The system prompt used for outcome-only RL.
  • ...and 6 more figures