Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking
Paria Rashidinejad, Yuandong Tian
TL;DR
This work tackles reward hacking in offline preference optimization by formalizing a two-source distribution-shift setting and identifying two hacking types: Type I (poorly covered subpar choices appear favorable) and Type II (well-valued choices appear unfavored). It proposes POWER, a Weighted Entropy Robust Rewards framework, to guard against Type I hacking by emphasizing well-supported actions, and POWER-DL, which adds Dynamic Labels to curb Type II hacking by shrinking gradients for untrustworthy samples. The approach yields finite-sample guarantees and strong empirical performance, outperforming state-of-the-art PO methods on alignment benchmarks like AlpacaEval 2.0 and Arena-Hard by up to 13.0 and 11.5 points respectively, while preserving downstream tasks such as GSM8K. The combination of theory and practice suggests POWER-DL as a robust, principled path toward more reliable alignment in LLMs, with potential extensions to broader robustness challenges and online settings.
Abstract
Aligning AI systems with human preferences typically suffers from the infamous reward hacking problem, where optimization of an imperfect reward model leads to undesired behaviors. In this paper, we investigate reward hacking in offline preference optimization, which aims to improve an initial model using a preference dataset. We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less favorable. We prove that many (mainstream or theoretical) preference optimization methods suffer from both types of reward hacking. To mitigate Type I Reward Hacking, we propose POWER, a new preference optimization method that combines Guiasu's weighted entropy with a robust reward maximization objective. POWER enjoys finite-sample guarantees under general function approximation, competing with the best covered policy in the data. To mitigate Type II Reward Hacking, we analyze the learning dynamics of preference optimization and develop a novel technique that dynamically updates preference labels toward certain "stationary labels", resulting in diminishing gradients for untrustworthy samples. Empirically, POWER with dynamic labels (POWER-DL) consistently outperforms state-of-the-art methods on alignment benchmarks, achieving improvements of up to 13.0 points on AlpacaEval 2.0 and 11.5 points on Arena-Hard over DPO, while also improving or maintaining performance on downstream tasks such as mathematical reasoning. Strong theoretical guarantees and empirical results demonstrate the promise of POWER-DL in mitigating reward hacking.
