Table of Contents
Fetching ...

$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

Yi-Kai Zhang, Yueqing Sun, Hongyan Hao, Qi Gu, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye

TL;DR

This paper proposes a real-time statistical testing and dynamic budget allocation system that adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts, constructing a robust baseline that balances computational efficiency with extremely low variance.

Abstract

In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as $V_0$), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose $V_{0.5}$, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model's prior. By constructing a hypothesis test to evaluate the prior's reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator's Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that $V_{0.5}$ significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.

$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

TL;DR

This paper proposes a real-time statistical testing and dynamic budget allocation system that adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts, constructing a robust baseline that balances computational efficiency with extremely low variance.

Abstract

In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as ), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose , which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model's prior. By constructing a hypothesis test to evaluate the prior's reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator's Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.
Paper Structure (40 sections, 6 theorems, 48 equations, 5 figures)

This paper contains 40 sections, 6 theorems, 48 equations, 5 figures.

Key Result

Theorem 3.1

For a single-step policy gradient estimator $\hat{g}(\theta)$ using baseline $b$, the trace of its covariance matrix is strictly bounded by: where:

Figures (5)

  • Figure 1: Performance of $\boldsymbol{V}_{\text{0.5}}$ across six diverse mathematical reasoning benchmarks, demonstrates superiority over GRPO DeepSeek_Math and DAPO DAPO, achieving faster convergence and some over 10% performance improvement.
  • Figure 2: Demonstration of PPO, GRPO, and our proposed $V_{0.5}$ framework. While PPO requires a synchronously trained value model and GRPO relies on the empirical group mean, $V_{0.5}$ computes an adaptive baseline by fusing a prior from a frozen generalist value model ($V_0$) with sparse empirical rollouts via a dynamic weight $w$ (detailed in \ref{['thm:opt_weight']}, \ref{['eq:sigma']}, and \ref{['eq:delta']}).
  • Figure 3: Evolution of policy gradient norm.$V_{0.5}$ maintains a lower and more stable gradient norm than GRPO. By trading a strictly bounded bias for a reduced baseline MSE, it effectively neutralizes the variance amplification inherent in sparse rollouts.
  • Figure 4: Evolution of policy entropy. While GRPO's high-variance gradients cause rapid entropy decay, $V_{0.5}$ leverages low-noise baseline estimation to sustain higher entropy, ensuring robust exploration in reasoning tasks.
  • Figure 5: Performance of $V_{0.5}$ under extreme sparsity ($1, 2, 4,$ and $8$ rollouts) vs. standard GRPO ($16$ rollouts). To ensure a fair comparison, prompt batch sizes are adjusted to maintain constant per-step computational overhead.

Theorems & Definitions (14)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 3.4
  • Lemma 3.5
  • Theorem 3.6
  • proof : Proof of \ref{['thm:mse_bound']}
  • proof : Proof of \ref{['thm:mse_decomp']}
  • proof : Proof of \ref{['thm:opt_weight']}
  • proof : Proof of \ref{['thm:bias_bound']}
  • ...and 4 more