LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

Guanzheng Chen; Michael Qizhe Shieh; Lidong Bing

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

Guanzheng Chen, Michael Qizhe Shieh, Lidong Bing

TL;DR

LongRLVR is introduced to augment the sparse answer reward with a dense and verifiable context reward, demonstrating that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

TL;DR

Abstract

Paper Structure (39 sections, 2 theorems, 31 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 39 sections, 2 theorems, 31 equations, 6 figures, 1 table, 1 algorithm.

Introduction
Method
RLVR on Long Contexts: An Explicit Grounding Formulation
The Vanishing Grounding Gradient with Outcome-Only Rewards
LongRLVR: Learning with a Verifiable Context Reward
Theoretical Foundation
A Practical Instantiation: The Modulated F-Score Reward
Synthetic Data Generation for Grounded QA
Experimental Setup
Implementation Details
Data Curation.
Training Details.
Evaluation Protocol
Baselines.
Benchmarks.
...and 24 more sections

Key Result

Proposition 1

Under Assumption ass:sparse and the grounding parameterization in eq:param_dis, the gradient of the expected answer reward with respect to the logit $s_j$ for any essential chunk $c_j \in G$ is: Let $\Delta_j(T) \triangleq f(T \cup \{c_j\})-f(T)$ denote the marginal gain of chunk $c_j$ for any $T\subseteq G\setminus\{c_j\}$, and assume $\Delta_j(T)\le \bar{\delta}_j$ for some constant $\bar{\delt

Figures (6)

Figure 1: The accuracy reward and contextual recall of naive RLVR and LongRLVR on the training data.
Figure 2: Data format for LongRLVR training. The model is tasked to retrieve useful chunks from the long context before generating the final answer. These chunk identifiers are utilized to derive verifiable context rewards.
Figure 3: Study on reward components. The answer-only model suffers from stagnating contextual recall, which caps its final performance. The context-only model excels at recall but fails to achieve accurate rewards. By synergizing both signals, Qwen2.5-7B-1M-LongRLVR achieves the best and most stable performance on the LongBench v2 benchmark, proving that both rewards are essential.
Figure 4: Data quality ablation on LongBench v2. Left: The effect of rejection sampling quality. Right: The effect of different data filtering strategies. High-quality, challenging data is shown to be most effective. Results are reported on Qwen2.5-7B-1M-LongRLVR.
Figure 5: Ablation studies on key hyperparameters for LongRLVR. We analyze the overall performance on LongBench v2 while varying (a) the blending factor $\eta$ in the context reward, (b) the F-score parameter $\beta$, and (c) the number of chunks per document. Results are reported for both Qwen2.5-7B and LLaMA-3.1-8B.
...and 1 more figures

Theorems & Definitions (5)

Proposition 1: Vanishing Gradients for Grounding
Proposition 2: Non-Vanishing Grounding Signal
proof : Proof using REINFORCE
proof : Proof using GRPO
proof : Proof using REINFORCE

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

TL;DR

Abstract

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (5)