Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards
Guanning Zeng, Zhaoyi Zhou, Daman Arora, Andrea Zanette
TL;DR
This work tackles high variance in policy-gradient updates for Reinforcement Learning with Verifiable Rewards (RLVR) by introducing a James–Stein shrinkage baseline that combines per-prompt and across-prompt reward means. The baseline is constructed as a two-level leave-one-out estimator to preserve unbiasedness while adaptively shrinking toward the batch mean, with an optimal shrinkage coefficient computed from batch statistics. Theoretical analysis shows lower mean-squared error for the shrinkage baseline under the bias-variance tradeoff, yielding provable variance reduction in the gradient estimator, and it requires no additional hyperparameters. Empirically, the James–Stein baseline delivers consistent variance reductions and training stability improvements across mathematical reasoning, logic puzzles, and standard RLVR benchmarks, outperforming RLOO, BLOO, and other baselines under varying rollout budgets and model scales. Overall, the approach offers a simple, drop-in variance reduction technique for critic-free RLVR that scales across tasks and architectures, with strong practical impact for post-training reasoning models.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean for each prompt. Statistically, this centering acts as a control variate (or baseline), reducing the variance of the policy-gradient estimator. Typically, the mean reward is estimated using per-prompt empirical averages for each prompt in a batch. Drawing inspiration from Stein's paradox, we propose using shrinkage estimators that combine per-prompt and across-prompt means to improve the overall per-prompt mean estimation accuracy -- particularly in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our proposed baseline serves as a drop-in replacement for existing per-prompt mean baselines, requiring no additional hyper-parameters or computation. Empirically, shrinkage baselines consistently outperform standard empirical-mean baselines, leading to lower-variance gradient updates and improved training stability.
