Table of Contents
Fetching ...

Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

Paria Rashidinejad, Yuandong Tian

TL;DR

This work tackles reward hacking in offline preference optimization by formalizing a two-source distribution-shift setting and identifying two hacking types: Type I (poorly covered subpar choices appear favorable) and Type II (well-valued choices appear unfavored). It proposes POWER, a Weighted Entropy Robust Rewards framework, to guard against Type I hacking by emphasizing well-supported actions, and POWER-DL, which adds Dynamic Labels to curb Type II hacking by shrinking gradients for untrustworthy samples. The approach yields finite-sample guarantees and strong empirical performance, outperforming state-of-the-art PO methods on alignment benchmarks like AlpacaEval 2.0 and Arena-Hard by up to 13.0 and 11.5 points respectively, while preserving downstream tasks such as GSM8K. The combination of theory and practice suggests POWER-DL as a robust, principled path toward more reliable alignment in LLMs, with potential extensions to broader robustness challenges and online settings.

Abstract

Aligning AI systems with human preferences typically suffers from the infamous reward hacking problem, where optimization of an imperfect reward model leads to undesired behaviors. In this paper, we investigate reward hacking in offline preference optimization, which aims to improve an initial model using a preference dataset. We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less favorable. We prove that many (mainstream or theoretical) preference optimization methods suffer from both types of reward hacking. To mitigate Type I Reward Hacking, we propose POWER, a new preference optimization method that combines Guiasu's weighted entropy with a robust reward maximization objective. POWER enjoys finite-sample guarantees under general function approximation, competing with the best covered policy in the data. To mitigate Type II Reward Hacking, we analyze the learning dynamics of preference optimization and develop a novel technique that dynamically updates preference labels toward certain "stationary labels", resulting in diminishing gradients for untrustworthy samples. Empirically, POWER with dynamic labels (POWER-DL) consistently outperforms state-of-the-art methods on alignment benchmarks, achieving improvements of up to 13.0 points on AlpacaEval 2.0 and 11.5 points on Arena-Hard over DPO, while also improving or maintaining performance on downstream tasks such as mathematical reasoning. Strong theoretical guarantees and empirical results demonstrate the promise of POWER-DL in mitigating reward hacking.

Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

TL;DR

This work tackles reward hacking in offline preference optimization by formalizing a two-source distribution-shift setting and identifying two hacking types: Type I (poorly covered subpar choices appear favorable) and Type II (well-valued choices appear unfavored). It proposes POWER, a Weighted Entropy Robust Rewards framework, to guard against Type I hacking by emphasizing well-supported actions, and POWER-DL, which adds Dynamic Labels to curb Type II hacking by shrinking gradients for untrustworthy samples. The approach yields finite-sample guarantees and strong empirical performance, outperforming state-of-the-art PO methods on alignment benchmarks like AlpacaEval 2.0 and Arena-Hard by up to 13.0 and 11.5 points respectively, while preserving downstream tasks such as GSM8K. The combination of theory and practice suggests POWER-DL as a robust, principled path toward more reliable alignment in LLMs, with potential extensions to broader robustness challenges and online settings.

Abstract

Aligning AI systems with human preferences typically suffers from the infamous reward hacking problem, where optimization of an imperfect reward model leads to undesired behaviors. In this paper, we investigate reward hacking in offline preference optimization, which aims to improve an initial model using a preference dataset. We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less favorable. We prove that many (mainstream or theoretical) preference optimization methods suffer from both types of reward hacking. To mitigate Type I Reward Hacking, we propose POWER, a new preference optimization method that combines Guiasu's weighted entropy with a robust reward maximization objective. POWER enjoys finite-sample guarantees under general function approximation, competing with the best covered policy in the data. To mitigate Type II Reward Hacking, we analyze the learning dynamics of preference optimization and develop a novel technique that dynamically updates preference labels toward certain "stationary labels", resulting in diminishing gradients for untrustworthy samples. Empirically, POWER with dynamic labels (POWER-DL) consistently outperforms state-of-the-art methods on alignment benchmarks, achieving improvements of up to 13.0 points on AlpacaEval 2.0 and 11.5 points on Arena-Hard over DPO, while also improving or maintaining performance on downstream tasks such as mathematical reasoning. Strong theoretical guarantees and empirical results demonstrate the promise of POWER-DL in mitigating reward hacking.

Paper Structure

This paper contains 58 sections, 12 theorems, 121 equations, 3 figures, 9 tables, 1 algorithm.

Key Result

Proposition 1

Consider multi-armed bandits with bounded rewards $r^\star(a) \in [0,1]$ and the softmax policy class, defined as Define the best-in-class policy $\pi_{\theta^\star} = \max_{\pi \in \Pi_\theta} J(\pi)$. There exist three-armed bandit instances with $\Pi_\theta$ parameterization, high coverage of the optimal arms $\mu(a \in \arg \max_a r^\star(a)) > 1/2$, and bounded KL-divergence $D_{\text{KL}}(\

Figures (3)

  • Figure 1: (a) Example of Type I Reward Hacking. The initial model has a uniform distribution over choices (e.g., responses) while the dataset has a high coverage on the high-reward choice and low coverage on a low-reward choice. With a decent chance, the poorly-covered, low-reward choice is labeled as preferred, causing PO methods to erroneously assign a high weight to it (Proposition \ref{['prop:po_overoptimization']}). (b) Example of Type II Reward Hacking. The initial model is aligned with the true rewards while dataset has a low coverage on the high-reward choice. With a decent chance, the poorly-covered, high-reward choice is labeled as rejected, leading to deterioration of the model post alignment (Proposition \ref{['prop:po_overoptimization_highC']}).
  • Figure 2: Performance of POWER-DL compared to DPO and SimPO. POWER-DL outperforms DPO and SimPO in alignment benchmarks AlpacaEval 2.0 and Arena-Hard across pipelines with different dataset sizes and levels of distribution shift between data and the initial model. In downstream mathematical reasoning task GSM8K, POWER-DL consistently maintains or improves mathematical reasoning performance while the performance of models trained with DPO and SimPO can drop significantly in some cases.
  • Figure 3: POWER-DL hyperparameter robustness results in the Helpsteer2 base setting.

Theorems & Definitions (24)

  • Proposition 1: Type I Reward Hacking in $\star$PO
  • Remark 1: Comparison with previous theoretical results on failure of DPO
  • Proposition 2: Type II Reward Hacking in $\star$PO
  • Definition 1: Weighted Entropy; guiacsu1971weighted
  • Proposition 3: POWER Objective
  • Remark 2
  • Theorem 1: Finite-Sample Performance Guarantees of POWER
  • Remark 3
  • Proposition 4
  • Theorem 2: Learning Dynamics with Label Updates
  • ...and 14 more