Table of Contents
Fetching ...

Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

Cassidy Laidlaw, Shivam Singhal, Anca Dragan

TL;DR

This work defines reward hacking via correlation between a proxy reward $\tilde{R}$ and the true reward $R$ under a reference policy, introducing an $r$-correlated proxy and a hacking threshold based on $J(\pi,R)$. It then proves a quantitative bound showing how regressing toward the reference policy with $\chi^2$ occupancy-measure regularization can provably prevent hacking, leading to the regularized objective $\max_\pi J(\pi,\tilde{R})-\lambda\sqrt{\chi^2(\mu_\pi\|\mu_{\pi_{ref}})}$ with $\lambda=\sigma_{\tilde{R}}\sqrt{1-r^2}$. The paper introduces ORPO, implementing $\chi^2$ OM regularization via a discriminator that estimates $\widehat{\chi^2}$ and augments the proxy reward to $R'(s,a)=\tilde{R}(s,a)-\frac{\lambda}{\sqrt{\widehat{\chi^2}}} e^{\hat{d}_\phi(s,a)}$, and demonstrates empirically that $\chi^2$ OM regularization outperforms AD KL in four non-RLHF environments and yields stronger, more stable gains in RLHF. These results suggest replacing the standard KL-penalty in RLHF with a principled $\chi^2$ OM regularization to better safeguard against reward hacking in complex, real-world objectives.

Abstract

Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using proxy reward functions that only approximate the true goal. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. To address this gap, we introduce a definition of reward hacking based on the correlation between proxy and true rewards for states and actions seen by a "reference policy" that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). Using our formulation, we show theoretically that regularization to the reference policy can effectively prevent reward hacking. While the current practice in RLHF applies a KL penalty between action distributions for this purpose, our theory suggests regularizing the $χ^2$ divergence between the policies' occupancy measures can be more effective. We intuitively show the benefits of this type of regularization and demonstrate that it better mitigates reward hacking in practice across four realistic settings, including RLHF. Our code is available at https://github.com/cassidylaidlaw/orpo.

Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

TL;DR

This work defines reward hacking via correlation between a proxy reward and the true reward under a reference policy, introducing an -correlated proxy and a hacking threshold based on . It then proves a quantitative bound showing how regressing toward the reference policy with occupancy-measure regularization can provably prevent hacking, leading to the regularized objective with . The paper introduces ORPO, implementing OM regularization via a discriminator that estimates and augments the proxy reward to , and demonstrates empirically that OM regularization outperforms AD KL in four non-RLHF environments and yields stronger, more stable gains in RLHF. These results suggest replacing the standard KL-penalty in RLHF with a principled OM regularization to better safeguard against reward hacking in complex, real-world objectives.

Abstract

Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using proxy reward functions that only approximate the true goal. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. To address this gap, we introduce a definition of reward hacking based on the correlation between proxy and true rewards for states and actions seen by a "reference policy" that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). Using our formulation, we show theoretically that regularization to the reference policy can effectively prevent reward hacking. While the current practice in RLHF applies a KL penalty between action distributions for this purpose, our theory suggests regularizing the divergence between the policies' occupancy measures can be more effective. We intuitively show the benefits of this type of regularization and demonstrate that it better mitigates reward hacking in practice across four realistic settings, including RLHF. Our code is available at https://github.com/cassidylaidlaw/orpo.
Paper Structure (37 sections, 10 theorems, 96 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 37 sections, 10 theorems, 96 equations, 6 figures, 10 tables, 1 algorithm.

Key Result

Theorem 5.1

Suppose that $\tilde{R}$ is an $r$-correlated-proxy for the true reward function $R$, and let $\sigma_{\tilde{R}}$ and $\sigma_{R}$ be defined as in Definition definition:proxy. Then for any policy $\pi$ such that $\mu_{\pi} \ll \mu_{{\pi_\text{ref}}}$ (i.e., $\mu_{{\pi_\text{ref}}}(s, a) = 0 \Right where $\chi^2 \left( \mu_\pi \| \mu_{\pi_\text{ref}} \right) = \mathbb{E}_{\mu_\pi} \left[ \frac{\m

Figures (6)

  • Figure 1: We present a new characterization of reward hacking and a method for preventing it. We define a proxy reward function as one that correlates with an unknown true reward function for state-action pairs sampled from some reference policy. However, optimizing the proxy alone can lead to a breakdown in the correlation and worse true reward than the reference policy. We show theoretically and empirically that optimizing the proxy with $\chi^2$ occupancy measure regularization to the reference policy can allow outperforming the reference policy under the unknown true reward.
  • Figure 2: Our definition of reward hacking successfully describes reward hacking behavior in four realistic environments and an illustrative gridworld. The top row shows the distribution of proxy and true reward values for state-action pairs sampled from a domain-appropriate reference policy for each environment; in all environments, the proxy and true rewards are correlated. However, as shown in the middle row, this correlation breaks down if the proxy is optimized via RL, and the true reward achieved is lower than that of the reference policy, which we characterize as reward hacking. In line with our theoretical results, the bottom row shows that RL with occupancy measure regularization to the reference policy can prevent reward hacking while enabling an increase in true reward.
  • Figure 3: Unlike RLHF, which attempts to prevent reward hacking by regularizing action distribution (AD) divergence from the reference policy, our results suggest regularizing using occupancy measure (OM) divergence is more effective. These plots of the glucose monitoring environment show the typical ADs and OMs of two policies. $\pi$ is close to ${\pi_\text{ref}}$ in AD; it gives slightly less insulin. However, $\pi$'s optimization of the proxy also leads to a vastly different OM with typical glucose levels far outside the healthy range (dotted lines). Thus, regularizing ADs to be close to ${\pi_\text{ref}}$ is not enough to prevent reward hacking. Instead, divergence between the OMs better captures the reward hacking behavior.
  • Figure 4: Our theory also suggests reward hacking can more effectively be prevented by regularizing with $\chi^2$ divergence instead of KL divergence. This plot illustrates how $\chi^2$ regularization is more effective at preventing reward hacking in RLHF. Both divergences can be written as the expectation of a function $g(\log(\mu_{\pi}(s, a) / \mu_{{\pi_\text{ref}}}(s, a)))$ which increases the penalty on state-action pairs based on how far the log-ratio is from zero. The $g(\cdot)$ associated with KL divergence only increases slowly for large log-ratios, so policies trained with KL divergence may produce nonsensical text. In contrast, the $g(\cdot)$ for $\chi^2$ divergence increases exponentially, better constraining the LLM to produce text similar to the SFT policy.
  • Figure 5: The true reward achieved by policies regularized with varying amounts of action distribution or occupancy measure regularization using $\chi^2$ and KL divergence. The x-axis is the regularization coefficient $\lambda$ normalized by the standard deviation of proxy rewards under the reference policy. Dots indicate the median reward and the shaded area is the range over random seeds. For RLHF, AD and OM regularization are equivalent, which is why OM regularization results are not shown for that column.
  • ...and 1 more figures

Theorems & Definitions (20)

  • Definition 4.1: Correlated proxy reward
  • Definition 4.2: Reward hacking
  • Theorem 5.1
  • Theorem A.1
  • proof
  • Corollary A.1
  • Lemma A.2
  • proof
  • Lemma A.3
  • proof
  • ...and 10 more