Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

Haechan Kim; Soohyun Ryu; Gyouk Chu; Doohyuk Jang; Eunho Yang

Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

Haechan Kim, Soohyun Ryu, Gyouk Chu, Doohyuk Jang, Eunho Yang

Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples drawn from a policy-induced distribution and casting advantage computation as the problem of estimating the reward distribution from finite data. Building on this view, we propose Discounted Beta--Bernoulli (DBB) reward estimation, which leverages historical reward statistics for the non-stationary distribution. Although biased, the resulting estimator exhibits reduced and stable variance, theoretically avoids estimated variance collapse, and achieves lower mean squared error than standard point estimation. Extensive experiments across six in-distribution and three out-of-distribution reasoning benchmarks demonstrate that GRPO with DBB consistently outperforms naive GRPO, achieving average Acc@8 improvements of 3.22/2.42 points in-distribution and 12.49/6.92 points out-of-distribution on the 1.7B and 8B models, respectively, without additional computational cost or memory usage.

Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

Abstract

Paper Structure (33 sections, 38 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 38 equations, 3 figures, 5 tables, 1 algorithm.

Introduction
Preliminaries
Reinforcement Learning with Verifiable Rewards
Group Relative Policy Optimization (GRPO)
Method
Reward Estimation as Distributional Inference
Discounted Beta--Bernoulli Reward Estimation
Mean Squared Error of the DBB estimator
Experiments
Experimental Settings
Models & Datasets.
Baselines.
Training & Evaluation Setups.
Main Results
Analysis & Discussion
...and 18 more sections

Figures (3)

Figure 1: Comparison between point estimation and DBB estimation. By trading a small bias for substantial variance reduction via shrinkage, DBB estimation achieves lower mean squared error. Compared to naive GRPO using point estimation, GRPO with DBB estimation consistently demonstrates superior performance across all benchmarks and both model scales.
Figure 2: Training dynamics of naive GRPO and GRPO with the DBB estimator (GRPO-DBB) on Qwen3-1.7B-Base (top) and Qwen3-8B-Base (bottom). GRPO-DBB achieves higher validation Acc@8 and training rewards, while maintaining longer responses with controlled entropy compared to GRPO, indicating more stable exploration during training.
Figure 3: MSE as a function of the discount factor $\lambda$ and the number of rollouts $N$. The DBB estimator yields lower MSE than the point estimator across a wide range of $\lambda$, and it achieves lower MSE for rollout budgets up to $N=16$ when $\lambda=0.4$.

Theorems & Definitions (2)

Definition 1: Beta--Bernoulli Reward Model
Definition 2: Discounted Beta--Bernoulli Reward Model

Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

Abstract

Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

Authors

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (2)