Table of Contents
Fetching ...

Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards

Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama

TL;DR

GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.

Abstract

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated in RLHF. We then show that Reference Resets of the KL penalty implicitly use GR to find flatter regions with higher reward accuracy. We further improve on this by proposing to use explicit GR with an efficient finite-difference estimate. Empirically, GR performs better than a KL penalty across a diverse set of RL experiments with LMs. GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.

Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards

TL;DR

GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.

Abstract

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated in RLHF. We then show that Reference Resets of the KL penalty implicitly use GR to find flatter regions with higher reward accuracy. We further improve on this by proposing to use explicit GR with an efficient finite-difference estimate. Empirically, GR performs better than a KL penalty across a diverse set of RL experiments with LMs. GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.
Paper Structure (50 sections, 8 theorems, 52 equations, 15 figures, 9 tables)

This paper contains 50 sections, 8 theorems, 52 equations, 15 figures, 9 tables.

Key Result

Proposition 3.3

Assume a Gaussian policy with fixed covariance $\Sigma$, where we denote the policy noise $Z\sim\mathcal{N}(0,\Sigma)$, and that $\widetilde{R}(s,a)$ is $\beta$-smooth in $a$. Further, $\mathfrak{J}(\phi^*) := \nabla_{\phi}\,\mu_{\phi}(s)\bigm|_{\phi=\phi^*}$ is the Jacobian matrix of the mean actio with radius $D^* \le \|\mathfrak{J}(\phi^*)\| \,\mathcal{E} +\mathcal{O}(\mathcal{E}^2)$. For a giv

Figures (15)

  • Figure 1: We argue that reward hacking often corresponds to exploiting sharp maxima in action space, as illustrated by the conceptual figure (left). For example, an LLM judge may be confused and assign a high reward to a wrong answer with specific formatting. In the LLM-as-a-Judge training run shown on the right, the increase in gradient norm coincides with reward hacking, resulting in true reward collapsing. By using gradient norm regularization, we can prevent this issue and obtain a better model, as seen by the improved true reward. The examples show Qwen2.5-0.5B models trained on GSM8K with a Qwen2.5-1.5B-Instruct judge with access to the true answer.
  • Figure 2: Conceptual illustration of our theoretical argument: (left) Regularizing the gradient norm biases optimization toward flat basins in parameter space, and (right) under action-smoothness, a flat maximum makes $\delta$-close pairs unlikely to have a reward gap larger than $K$, i.e. decreases the probability of overly sharp action pairs $a_1,a_2:\|a_1-a_2\|\le\delta,|\widetilde{R}(s,a_1)-\widetilde{R}(s,a_2)|>K$. Under the assumption of a Lipschitz-continuous true reward $R^*$, each such pair implies an incorrect proxy reward $\widetilde{R}$.
  • Figure 3: When gradient norm decreases in a reset iteration, so do sharpness, and BT Loss $\mathcal{L}_\mathrm{RM}$. Evolution of reward, sharpness and BT loss during training on TL;DR with Pythia 1B using GRPO+reference resets, resets shown as grey dashed lines. After initially spiking in an iteration, gradient norm decreases along with the sharpness of the parameters and the BT-loss under the current policy. We show moving averages over 30 steps.
  • Figure 4: Reference Resets outperform all possible weights $\beta$ of KL penalty. Oracle evaluation (Gold Model Score) vs KL from initial model for Pythia 1B on the TL;DR test set.
  • Figure 5: Explicit GR performs well even with inaccurate RMs. RM accuracy on SFT data vs GPT 4.1 Accuracy for different SFT and RM models, corresponding to different random seeds for full RLHF pipeline. The x-axis scale is nonlinear.
  • ...and 10 more figures

Theorems & Definitions (16)

  • Definition 3.1: $\mathcal{E}-\widehat{L}$ flat reward maximum
  • Definition 3.2: $(\delta,K,\rho)$-pairwise robust policy
  • Proposition 3.3: $\mathcal{E}-\widehat{L}$ flat reward implies $(\delta,K,\rho)$ robust policy
  • Proposition 3.4
  • Definition 2.1: $D-\widehat{L}$ action robust policy, slightly modified from lee_flat_2025
  • Proposition 2.5: $\mathcal{E}-\widehat{L}$ flat return implies $D-\widehat{L}$ robust policy, slightly modified from lee_flat_2025
  • proof
  • Lemma 2.6: Flatness and $\beta$-smoothness imply bounded gradient
  • proof
  • Lemma 2.7: Pointwise action-gradient is controlled by Gaussian-smoothing
  • ...and 6 more