$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

Yi-Kai Zhang; Yueqing Sun; Hongyan Hao; Qi Gu; Xunliang Cai; De-Chuan Zhan; Han-Jia Ye

$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

Yi-Kai Zhang, Yueqing Sun, Hongyan Hao, Qi Gu, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye

TL;DR

This paper proposes a real-time statistical testing and dynamic budget allocation system that adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts, constructing a robust baseline that balances computational efficiency with extremely low variance.

Abstract

In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as $V_0$), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose $V_{0.5}$, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model's prior. By constructing a hypothesis test to evaluate the prior's reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator's Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that $V_{0.5}$ significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.

$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

TL;DR

Abstract

), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose

, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model's prior. By constructing a hypothesis test to evaluate the prior's reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator's Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that

significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.

Paper Structure (40 sections, 6 theorems, 48 equations, 5 figures)

This paper contains 40 sections, 6 theorems, 48 equations, 5 figures.

Introduction
Preliminaries
Policy Gradients and the Baseline
Generalist Value Models: Breaking the Coupling Dilemma
The Bias-Variance Tradeoff in Sparse Rollouts
Unified Advantage Formulations
Method
Core Execution Logic of $V_{0.5}$
Motivation: Propagation Limits of Baseline MSE on Gradient Variance
Empirical Shrinkage Fusion
Empirical Weight Estimation
Bias Bounds of the Estimator
Sequential OSLA Allocation and Optimal Stopping
Experiments
Experimental Setup and Implementation Details
...and 25 more sections

Key Result

Theorem 3.1

For a single-step policy gradient estimator $\hat{g}(\theta)$ using baseline $b$, the trace of its covariance matrix is strictly bounded by: where:

Figures (5)

Figure 1: Performance of $\boldsymbol{V}_{\text{0.5}}$ across six diverse mathematical reasoning benchmarks, demonstrates superiority over GRPO DeepSeek_Math and DAPO DAPO, achieving faster convergence and some over 10% performance improvement.
Figure 2: Demonstration of PPO, GRPO, and our proposed $V_{0.5}$ framework. While PPO requires a synchronously trained value model and GRPO relies on the empirical group mean, $V_{0.5}$ computes an adaptive baseline by fusing a prior from a frozen generalist value model ($V_0$) with sparse empirical rollouts via a dynamic weight $w$ (detailed in \ref{['thm:opt_weight']}, \ref{['eq:sigma']}, and \ref{['eq:delta']}).
Figure 3: Evolution of policy gradient norm.$V_{0.5}$ maintains a lower and more stable gradient norm than GRPO. By trading a strictly bounded bias for a reduced baseline MSE, it effectively neutralizes the variance amplification inherent in sparse rollouts.
Figure 4: Evolution of policy entropy. While GRPO's high-variance gradients cause rapid entropy decay, $V_{0.5}$ leverages low-noise baseline estimation to sustain higher entropy, ensuring robust exploration in reasoning tasks.
Figure 5: Performance of $V_{0.5}$ under extreme sparsity ($1, 2, 4,$ and $8$ rollouts) vs. standard GRPO ($16$ rollouts). To ensure a fair comparison, prompt batch sizes are adjusted to maintain constant per-step computational overhead.

Theorems & Definitions (14)

Theorem 3.1
Theorem 3.2
Theorem 3.3
Theorem 3.4
Lemma 3.5
Theorem 3.6
proof : Proof of \ref{['thm:mse_bound']}
proof : Proof of \ref{['thm:mse_decomp']}
proof : Proof of \ref{['thm:opt_weight']}
proof : Proof of \ref{['thm:bias_bound']}
...and 4 more

$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

TL;DR

Abstract

$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (14)