Table of Contents
Fetching ...

Bayesian Reward Models for LLM Alignment

Adam X. Yang, Maxime Robeyns, Thomas Coste, Zhengyan Shi, Jun Wang, Haitham Bou-Ammar, Laurence Aitchison

TL;DR

This work tackles reward overoptimization and hacking in LLM alignment by introducing Bayesian reward models based on Laplace-LoRA, which quantify epistemic uncertainty on reward predictions. By penalizing high-uncertainty candidates and combining with reward ensembles, the approach mitigates BoN overfitting and improves gold-standard alignment signals, particularly under distribution shift. The results show substantial gains in BoN and RLHF settings, highlighting the value of uncertainty-aware radiation of proxy rewards. Overall, this Bayesian, parameter-efficient framework enhances robustness and safety in LLM alignment.

Abstract

To ensure that large language model (LLM) responses are helpful and non-toxic, a reward model trained on human preference data is usually used. LLM responses with high rewards are then selected through best-of-$n$ (BoN) sampling or the LLM is further optimized to produce responses with high rewards through reinforcement learning from human feedback (RLHF). However, these processes are susceptible to reward overoptimization or `hacking', where responses receive high rewards due to imperfections in the reward model rather than true preference, particularly as prompts or responses deviate from the training data. To address these challenges, we propose to train a Bayesian reward model, which signals higher uncertainty further from the training data distribution. We trained Bayesian reward models using Laplace approximation on LoRA weights, and found that the resulting uncertainty estimates can effectively mitigate reward overoptimization in BoN sampling.

Bayesian Reward Models for LLM Alignment

TL;DR

This work tackles reward overoptimization and hacking in LLM alignment by introducing Bayesian reward models based on Laplace-LoRA, which quantify epistemic uncertainty on reward predictions. By penalizing high-uncertainty candidates and combining with reward ensembles, the approach mitigates BoN overfitting and improves gold-standard alignment signals, particularly under distribution shift. The results show substantial gains in BoN and RLHF settings, highlighting the value of uncertainty-aware radiation of proxy rewards. Overall, this Bayesian, parameter-efficient framework enhances robustness and safety in LLM alignment.

Abstract

To ensure that large language model (LLM) responses are helpful and non-toxic, a reward model trained on human preference data is usually used. LLM responses with high rewards are then selected through best-of- (BoN) sampling or the LLM is further optimized to produce responses with high rewards through reinforcement learning from human feedback (RLHF). However, these processes are susceptible to reward overoptimization or `hacking', where responses receive high rewards due to imperfections in the reward model rather than true preference, particularly as prompts or responses deviate from the training data. To address these challenges, we propose to train a Bayesian reward model, which signals higher uncertainty further from the training data distribution. We trained Bayesian reward models using Laplace approximation on LoRA weights, and found that the resulting uncertainty estimates can effectively mitigate reward overoptimization in BoN sampling.
Paper Structure (18 sections, 20 equations, 5 figures, 4 tables)

This paper contains 18 sections, 20 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustrations of reward overoptimization in LLM alignment.
  • Figure 2: Comparison of proxy and gold reward scores (normalized) of single reward model (MAP) and Laplace-LoRA reward model (LA) in BoN sampling, across different uncertainty penalties and a range of $k$. Left column: compares the proxy reward model's evaluation. Right column: compares the gold reward model's evaluation.
  • Figure 3: Comparison of proxy and gold reward scores (normalized) of single reward model (MAP), reward model ensemble (Ens), and Laplace-LoRA reward model ensemble (LA Ens) in BoN sampling, across different uncertainty penalties and a range of $k$.
  • Figure 4: Comparison of proxy and gold reward scores (normalized) in BoN sampling, across different uncertainty penalties and a range of $k$. Left column: compares the proxy reward model's evaluation. Right column: compares the gold reward model's evaluation.
  • Figure 5: Comparison of proxy and gold reward scores (normalized) in BoN sampling, across different uncertainty penalties and a range of $k$. Left column: compares the proxy reward model's evaluation. Right column: compares the gold reward model's evaluation.