Table of Contents
Fetching ...

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

Guanzheng Chen, Michael Qizhe Shieh, Lidong Bing

TL;DR

LongRLVR is introduced to augment the sparse answer reward with a dense and verifiable context reward, demonstrating that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

TL;DR

LongRLVR is introduced to augment the sparse answer reward with a dense and verifiable context reward, demonstrating that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.
Paper Structure (39 sections, 2 theorems, 31 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 39 sections, 2 theorems, 31 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Under Assumption ass:sparse and the grounding parameterization in eq:param_dis, the gradient of the expected answer reward with respect to the logit $s_j$ for any essential chunk $c_j \in G$ is: Let $\Delta_j(T) \triangleq f(T \cup \{c_j\})-f(T)$ denote the marginal gain of chunk $c_j$ for any $T\subseteq G\setminus\{c_j\}$, and assume $\Delta_j(T)\le \bar{\delta}_j$ for some constant $\bar{\delt

Figures (6)

  • Figure 1: The accuracy reward and contextual recall of naive RLVR and LongRLVR on the training data.
  • Figure 2: Data format for LongRLVR training. The model is tasked to retrieve useful chunks from the long context before generating the final answer. These chunk identifiers are utilized to derive verifiable context rewards.
  • Figure 3: Study on reward components. The answer-only model suffers from stagnating contextual recall, which caps its final performance. The context-only model excels at recall but fails to achieve accurate rewards. By synergizing both signals, Qwen2.5-7B-1M-LongRLVR achieves the best and most stable performance on the LongBench v2 benchmark, proving that both rewards are essential.
  • Figure 4: Data quality ablation on LongBench v2. Left: The effect of rejection sampling quality. Right: The effect of different data filtering strategies. High-quality, challenging data is shown to be most effective. Results are reported on Qwen2.5-7B-1M-LongRLVR.
  • Figure 5: Ablation studies on key hyperparameters for LongRLVR. We analyze the overall performance on LongBench v2 while varying (a) the blending factor $\eta$ in the context reward, (b) the F-score parameter $\beta$, and (c) the number of chunks per document. Results are reported for both Qwen2.5-7B and LLaMA-3.1-8B.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Proposition 1: Vanishing Gradients for Grounding
  • Proposition 2: Non-Vanishing Grounding Signal
  • proof : Proof using REINFORCE
  • proof : Proof using GRPO
  • proof : Proof using REINFORCE