Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
Cassidy Laidlaw, Shivam Singhal, Anca Dragan
TL;DR
This work defines reward hacking via correlation between a proxy reward $\tilde{R}$ and the true reward $R$ under a reference policy, introducing an $r$-correlated proxy and a hacking threshold based on $J(\pi,R)$. It then proves a quantitative bound showing how regressing toward the reference policy with $\chi^2$ occupancy-measure regularization can provably prevent hacking, leading to the regularized objective $\max_\pi J(\pi,\tilde{R})-\lambda\sqrt{\chi^2(\mu_\pi\|\mu_{\pi_{ref}})}$ with $\lambda=\sigma_{\tilde{R}}\sqrt{1-r^2}$. The paper introduces ORPO, implementing $\chi^2$ OM regularization via a discriminator that estimates $\widehat{\chi^2}$ and augments the proxy reward to $R'(s,a)=\tilde{R}(s,a)-\frac{\lambda}{\sqrt{\widehat{\chi^2}}} e^{\hat{d}_\phi(s,a)}$, and demonstrates empirically that $\chi^2$ OM regularization outperforms AD KL in four non-RLHF environments and yields stronger, more stable gains in RLHF. These results suggest replacing the standard KL-penalty in RLHF with a principled $\chi^2$ OM regularization to better safeguard against reward hacking in complex, real-world objectives.
Abstract
Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using proxy reward functions that only approximate the true goal. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. To address this gap, we introduce a definition of reward hacking based on the correlation between proxy and true rewards for states and actions seen by a "reference policy" that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). Using our formulation, we show theoretically that regularization to the reference policy can effectively prevent reward hacking. While the current practice in RLHF applies a KL penalty between action distributions for this purpose, our theory suggests regularizing the $χ^2$ divergence between the policies' occupancy measures can be more effective. We intuitively show the benefits of this type of regularization and demonstrate that it better mitigates reward hacking in practice across four realistic settings, including RLHF. Our code is available at https://github.com/cassidylaidlaw/orpo.
