Table of Contents
Fetching ...

Probabilistic Uncertain Reward Model

Wangtao Sun, Xiang Cheng, Xing Yu, Haotian Xu, Zhao Yang, Shizhu He, Jun Zhao, Kang Liu

TL;DR

This work addresses overconfidence and reward hacking in RLHF by generalizing the Bradley-Terry reward model to a probabilistic framework, where each prompt-response pair yields a Gaussian reward $r\sim\mathcal{N}(\mu,\sigma)$. PURM uses a two-head architecture to output $(\mu,\log\sigma)$ and derives a maximum likelihood objective for pairwise preferences, with uncertainty quantified via Bhattacharyya Coefficients between reward distributions. The uncertainty estimate $u(x,y)$ is used to penalize uncertain rewards during RLHF, effectively steering learning away from uncertain regions. Empirical results show PURM achieves competitive reward accuracy, sound uncertainty estimates (capturing aleatoric and epistemic uncertainty), and improved resistance to reward hacking, yielding longer stable optimization and higher final win rates. The approach offers a practical path to more robust RLHF systems, though its applicability to other reward-modeling paradigms remains an area for future work.

Abstract

Reinforcement learning from human feedback (RLHF) is a critical technique for training large language models. However, conventional reward models based on the Bradley-Terry model (BTRM) often suffer from overconfidence when faced with inconsistent labels or out-of-distribution samples, leading to reward hacking, where the policy model blindly optimizes for proxy rewards while degrading true performance. This paper proposes the Probabilistic Uncertain Reward Model (PURM), which generalizes the Bradley-Terry model to learn the reward distributions that emerged from the preference data. We theoretically derive the loss function of PURM and introduce a novel method that uses the overlap between distributions to quantify uncertainty. Empirical results show that PURM outperforms existing methods with more accurate reward and sound uncertainty estimations, and sustains effective learning for more optimization steps and obtain higher maximum win rate in RLHF. The data and code of this paper are released at https://anonymous.4open.science/r/Probabilistic-Uncertain-Reward-Model/

Probabilistic Uncertain Reward Model

TL;DR

This work addresses overconfidence and reward hacking in RLHF by generalizing the Bradley-Terry reward model to a probabilistic framework, where each prompt-response pair yields a Gaussian reward . PURM uses a two-head architecture to output and derives a maximum likelihood objective for pairwise preferences, with uncertainty quantified via Bhattacharyya Coefficients between reward distributions. The uncertainty estimate is used to penalize uncertain rewards during RLHF, effectively steering learning away from uncertain regions. Empirical results show PURM achieves competitive reward accuracy, sound uncertainty estimates (capturing aleatoric and epistemic uncertainty), and improved resistance to reward hacking, yielding longer stable optimization and higher final win rates. The approach offers a practical path to more robust RLHF systems, though its applicability to other reward-modeling paradigms remains an area for future work.

Abstract

Reinforcement learning from human feedback (RLHF) is a critical technique for training large language models. However, conventional reward models based on the Bradley-Terry model (BTRM) often suffer from overconfidence when faced with inconsistent labels or out-of-distribution samples, leading to reward hacking, where the policy model blindly optimizes for proxy rewards while degrading true performance. This paper proposes the Probabilistic Uncertain Reward Model (PURM), which generalizes the Bradley-Terry model to learn the reward distributions that emerged from the preference data. We theoretically derive the loss function of PURM and introduce a novel method that uses the overlap between distributions to quantify uncertainty. Empirical results show that PURM outperforms existing methods with more accurate reward and sound uncertainty estimations, and sustains effective learning for more optimization steps and obtain higher maximum win rate in RLHF. The data and code of this paper are released at https://anonymous.4open.science/r/Probabilistic-Uncertain-Reward-Model/

Paper Structure

This paper contains 22 sections, 22 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: The architectures and performance curves in RLHF of BTRM and PURM.
  • Figure 2: Models' estimations of aleatoric uncertainty. PURM successfully recognizes the noise underlying the training data and generates corresponding uncertainties, while BTE and BRME struggle to model such a pattern.
  • Figure 3: Models' estimations of epistemic uncertainty. Compared to BTE and BRME, PURM demonstrates a significantly distinct behavior: it shows lower uncertainty on in-domain data while exhibiting higher uncertainty on OOD data.
  • Figure 4: The valid rewards of PURM and other RMs during the RLHF. It can be observed that PURM significantly delays the occurrence of performance degradation and achieves the highest win rate against the reference policy model GPT-4o.
  • Figure 5: The ablation results of the uncertainty choice and penalty weight $\lambda$ of PURM.
  • ...and 7 more figures