Table of Contents
Fetching ...

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

Shinnosuke Ono, Johannes Ackermann, Soichiro Nishimori, Takashi Ishida, Masashi Sugiyama

Abstract

Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

Abstract

Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.

Paper Structure

This paper contains 20 sections, 3 theorems, 15 equations, 4 figures, 2 tables.

Key Result

Theorem 3.1

Under the linear head model and the uncertainty set $\mathcal{U}_\epsilon^w$eq:uncertainty-head, the certified sign-preservation radius is $\blacktriangleleft$$\blacktriangleleft$

Figures (4)

  • Figure 1: We argue that the reliability of the proxy RM's estimates differs by completion. Certified sign-preservation radius $\Delta_j$ provides this reliability measure.(a) Proxy and true advantages. Completions 7 and 8 have opposite signs, showing the proxy RM is unreliable there. (b)$\Delta_j$ is the smallest perturbation of the RM parameters that flips a completion's advantage sign. Dashed lines are decision boundaries for the $j$- and $k$-th completions. (c) Per-completion $\Delta_j.$ Completions 7 and 8 exhibit low $\Delta$, confirming $\Delta_j$ identifies unreliable completions. SignCert-PO suppresses completions below $\Delta=\epsilon$. (d) True (solid) and proxy (dashed) reward during RL. Using $\Delta_j$ as a re-weighting mechanism, SignCert-PO prevents reward hacking and further improves the true reward. See Appendix \ref{['app:toy-bandit']} for details.
  • Figure 2: SignCert-PO keeps the policy in regions where the proxy RM remains reliable, preventing reward hacking. KL divergence trade-offs on TL;DR. Left: proxy RM accuracy vs. KL. SignCert-PO maintains higher RM accuracy at every KL budget. Right: gold reward (solid) and proxy reward (dashed) vs. KL. Baselines exhibit reward hacking, whereas SignCert-PO avoids this divergence. The reference policy is the SFT model $\pi_\mathrm{SFT}.$
  • Figure 3: $\Delta_j$\ref{['eq:certified-radius-param']} is predictive of sign robustness beyond the linear head assumption, on the TL;DR task for Pythia 1B. Left axis: agreement with other perturbation models, where $A'_j$ is the advantage recomputed under whole-RM or input embedding perturbation. Right axis: agreement with the gold RM. See Appendix \ref{['app:other-experimental-details']} for details.
  • Figure 4: SignCert-PO provides the largest gains when preference data is limited, with the gap narrowing as more data becomes available. Gold win rate vs. number of preference data epochs on TL;DR for the Pythia 1B proxy RM. We also observe overfitting of the proxy RM for 2.3M pairs.

Theorems & Definitions (4)

  • Definition 3.1: Certified sign-preservation radius
  • Theorem 3.1: Certified radius
  • Theorem 3.2: Worst-case advantage under per-completion adversary
  • Lemma 3.3: Policy gradient of the global robust objective