Probabilistic Uncertain Reward Model

Wangtao Sun; Xiang Cheng; Xing Yu; Haotian Xu; Zhao Yang; Shizhu He; Jun Zhao; Kang Liu

Probabilistic Uncertain Reward Model

Wangtao Sun, Xiang Cheng, Xing Yu, Haotian Xu, Zhao Yang, Shizhu He, Jun Zhao, Kang Liu

TL;DR

This work addresses overconfidence and reward hacking in RLHF by generalizing the Bradley-Terry reward model to a probabilistic framework, where each prompt-response pair yields a Gaussian reward $r\sim\mathcal{N}(\mu,\sigma)$. PURM uses a two-head architecture to output $(\mu,\log\sigma)$ and derives a maximum likelihood objective for pairwise preferences, with uncertainty quantified via Bhattacharyya Coefficients between reward distributions. The uncertainty estimate $u(x,y)$ is used to penalize uncertain rewards during RLHF, effectively steering learning away from uncertain regions. Empirical results show PURM achieves competitive reward accuracy, sound uncertainty estimates (capturing aleatoric and epistemic uncertainty), and improved resistance to reward hacking, yielding longer stable optimization and higher final win rates. The approach offers a practical path to more robust RLHF systems, though its applicability to other reward-modeling paradigms remains an area for future work.

Abstract

Reinforcement learning from human feedback (RLHF) is a critical technique for training large language models. However, conventional reward models based on the Bradley-Terry model (BTRM) often suffer from overconfidence when faced with inconsistent labels or out-of-distribution samples, leading to reward hacking, where the policy model blindly optimizes for proxy rewards while degrading true performance. This paper proposes the Probabilistic Uncertain Reward Model (PURM), which generalizes the Bradley-Terry model to learn the reward distributions that emerged from the preference data. We theoretically derive the loss function of PURM and introduce a novel method that uses the overlap between distributions to quantify uncertainty. Empirical results show that PURM outperforms existing methods with more accurate reward and sound uncertainty estimations, and sustains effective learning for more optimization steps and obtain higher maximum win rate in RLHF. The data and code of this paper are released at https://anonymous.4open.science/r/Probabilistic-Uncertain-Reward-Model/

Probabilistic Uncertain Reward Model

TL;DR

This work addresses overconfidence and reward hacking in RLHF by generalizing the Bradley-Terry reward model to a probabilistic framework, where each prompt-response pair yields a Gaussian reward

. PURM uses a two-head architecture to output

and derives a maximum likelihood objective for pairwise preferences, with uncertainty quantified via Bhattacharyya Coefficients between reward distributions. The uncertainty estimate

is used to penalize uncertain rewards during RLHF, effectively steering learning away from uncertain regions. Empirical results show PURM achieves competitive reward accuracy, sound uncertainty estimates (capturing aleatoric and epistemic uncertainty), and improved resistance to reward hacking, yielding longer stable optimization and higher final win rates. The approach offers a practical path to more robust RLHF systems, though its applicability to other reward-modeling paradigms remains an area for future work.

Probabilistic Uncertain Reward Model

TL;DR

Abstract

Probabilistic Uncertain Reward Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)