Table of Contents
Fetching ...

Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards

Guanning Zeng, Zhaoyi Zhou, Daman Arora, Andrea Zanette

TL;DR

This work tackles high variance in policy-gradient updates for Reinforcement Learning with Verifiable Rewards (RLVR) by introducing a James–Stein shrinkage baseline that combines per-prompt and across-prompt reward means. The baseline is constructed as a two-level leave-one-out estimator to preserve unbiasedness while adaptively shrinking toward the batch mean, with an optimal shrinkage coefficient computed from batch statistics. Theoretical analysis shows lower mean-squared error for the shrinkage baseline under the bias-variance tradeoff, yielding provable variance reduction in the gradient estimator, and it requires no additional hyperparameters. Empirically, the James–Stein baseline delivers consistent variance reductions and training stability improvements across mathematical reasoning, logic puzzles, and standard RLVR benchmarks, outperforming RLOO, BLOO, and other baselines under varying rollout budgets and model scales. Overall, the approach offers a simple, drop-in variance reduction technique for critic-free RLVR that scales across tasks and architectures, with strong practical impact for post-training reasoning models.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean for each prompt. Statistically, this centering acts as a control variate (or baseline), reducing the variance of the policy-gradient estimator. Typically, the mean reward is estimated using per-prompt empirical averages for each prompt in a batch. Drawing inspiration from Stein's paradox, we propose using shrinkage estimators that combine per-prompt and across-prompt means to improve the overall per-prompt mean estimation accuracy -- particularly in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our proposed baseline serves as a drop-in replacement for existing per-prompt mean baselines, requiring no additional hyper-parameters or computation. Empirically, shrinkage baselines consistently outperform standard empirical-mean baselines, leading to lower-variance gradient updates and improved training stability.

Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards

TL;DR

This work tackles high variance in policy-gradient updates for Reinforcement Learning with Verifiable Rewards (RLVR) by introducing a James–Stein shrinkage baseline that combines per-prompt and across-prompt reward means. The baseline is constructed as a two-level leave-one-out estimator to preserve unbiasedness while adaptively shrinking toward the batch mean, with an optimal shrinkage coefficient computed from batch statistics. Theoretical analysis shows lower mean-squared error for the shrinkage baseline under the bias-variance tradeoff, yielding provable variance reduction in the gradient estimator, and it requires no additional hyperparameters. Empirically, the James–Stein baseline delivers consistent variance reductions and training stability improvements across mathematical reasoning, logic puzzles, and standard RLVR benchmarks, outperforming RLOO, BLOO, and other baselines under varying rollout budgets and model scales. Overall, the approach offers a simple, drop-in variance reduction technique for critic-free RLVR that scales across tasks and architectures, with strong practical impact for post-training reasoning models.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean for each prompt. Statistically, this centering acts as a control variate (or baseline), reducing the variance of the policy-gradient estimator. Typically, the mean reward is estimated using per-prompt empirical averages for each prompt in a batch. Drawing inspiration from Stein's paradox, we propose using shrinkage estimators that combine per-prompt and across-prompt means to improve the overall per-prompt mean estimation accuracy -- particularly in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our proposed baseline serves as a drop-in replacement for existing per-prompt mean baselines, requiring no additional hyper-parameters or computation. Empirically, shrinkage baselines consistently outperform standard empirical-mean baselines, leading to lower-variance gradient updates and improved training stability.

Paper Structure

This paper contains 28 sections, 3 theorems, 43 equations, 12 figures, 12 tables.

Key Result

Proposition 1

Suppose $b_i^j$ is independent of $y_i^j$ for all $1\leq i\leq n$ and $1\leq j\leq m$. Then $g(\mathbf{x}, \mathbf{Y};\theta)$ is unbiased. That is,

Figures (12)

  • Figure 1: Overview of using the James-Stein Shrinkage Baseline in RLVR of large reasoning models. Consider a step in RLVR with $\textbf{n}$ question prompts, each generating $\textbf{m}$ responses. For every response, our method computes a leave-one-out prompt-level reward mean $\widehat{\mu}_i^{-j}$ and a leave-one-out batch-level reward mean $\widehat{\bar{\mu}}_{-i}$. It then estimates an optimal shrinkage coefficient $\widehat{\lambda}_i$ from reward-sample statistics. These components are combined to produce a variance-reduced baseline $b_i^j$. By lowering the variance in policy-gradient estimation, the JS baseline enables more effective reinforcement learning for large reasoning models.
  • Figure 2: Comparison of JS shrinkage baseline with RLOO ahmadian2024back baseline on Qwen2.5 math models trained on DAPO17k and MATH12k datasets. JS baseline significantly outperforms RLOO across different models and benchmarks.
  • Figure 3: Comparison of training reward and test accuracy between JS baseline and RLOO on Qwen3-4B-Base model trained on DAPO17k dataset.
  • Figure 4: Comparison of average scores on test set between JS baseline and RLOO on Logic Puzzle Reasoning Tasks. JS baseline outperforms RLOO across various tasks and models. Note that the number after Knights-and-Knaves (KnK) datasets denotes the quantity of people in the puzzle. Larger number suggests higher difficulty.
  • Figure 5: Comparison on training reward (running average) and test accuracy between JS baseline with RLOO on Qwen2.5-1.5B-Instruct model and KnK dataset.
  • ...and 7 more figures

Theorems & Definitions (3)

  • Proposition 1: Unbiasedness
  • Proposition 2
  • Theorem 1