Table of Contents
Fetching ...

Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

Kezhao Liu, Jason Klein Liu, Mingtao Chen, Yiming Liu

TL;DR

This work reframes KL regularization in RLHF from value estimation to gradient optimization, introducing a unified framework that connects KL terms used as reward coefficients with those used as explicit losses. It proves that under on-policy conditions, the conventional '$k_1$ in reward' and '$k_2$ as loss' are gradient-equivalent and constitute the principled KL gradient, while '$k_3$ as loss' is a biased first-order approximation. The authors also show that off-policy implementations require proper importance sampling corrections, and they propose principled alternatives (e.g., bounded losses) for stability. Empirical results on math reasoning tasks corroborate the theory, demonstrating that '$k_1$ as loss' is ineffective, '$k_2$ as loss' provides stronger regularization, and '$k_3$ as loss' can induce instability, thereby guiding robust RLHF design.

Abstract

Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from numerical value estimation-a practice that overlooks the term's functional role as an optimization loss. To analyze this issue, we establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term $k_n$ as a detached coefficient for the policy's score function ('$k_n$ in reward') or as a direct loss function through which gradients are propagated ('$k_n$ as loss'). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we prove that the conventional '$k_1$ in reward' (like in PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditions, the '$k_2$ as loss' formulation is, in fact, gradient-equivalent to '$k_1$ in reward'. This equivalence, first proven in our work, identifies both as the theoretically sound implementations of the RKL objective. In contrast, we show that the recently adopted '$k_3$ as loss' (like in GRPO) is merely a first-order, biased approximation of the principled loss. Furthermore, we argue that common off-policy implementations of '$k_n$ as loss' methods are biased due to neglected importance sampling, and we propose a principled correction. Our findings provide a comprehensive, gradient-based rationale for choosing and correctly implementing KL regularization, paving the way for more robust and effective RLHF systems.

Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

TL;DR

This work reframes KL regularization in RLHF from value estimation to gradient optimization, introducing a unified framework that connects KL terms used as reward coefficients with those used as explicit losses. It proves that under on-policy conditions, the conventional ' in reward' and ' as loss' are gradient-equivalent and constitute the principled KL gradient, while ' as loss' is a biased first-order approximation. The authors also show that off-policy implementations require proper importance sampling corrections, and they propose principled alternatives (e.g., bounded losses) for stability. Empirical results on math reasoning tasks corroborate the theory, demonstrating that ' as loss' is ineffective, ' as loss' provides stronger regularization, and ' as loss' can induce instability, thereby guiding robust RLHF design.

Abstract

Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from numerical value estimation-a practice that overlooks the term's functional role as an optimization loss. To analyze this issue, we establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term as a detached coefficient for the policy's score function (' in reward') or as a direct loss function through which gradients are propagated (' as loss'). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we prove that the conventional ' in reward' (like in PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditions, the ' as loss' formulation is, in fact, gradient-equivalent to ' in reward'. This equivalence, first proven in our work, identifies both as the theoretically sound implementations of the RKL objective. In contrast, we show that the recently adopted ' as loss' (like in GRPO) is merely a first-order, biased approximation of the principled loss. Furthermore, we argue that common off-policy implementations of ' as loss' methods are biased due to neglected importance sampling, and we propose a principled correction. Our findings provide a comprehensive, gradient-based rationale for choosing and correctly implementing KL regularization, paving the way for more robust and effective RLHF systems.

Paper Structure

This paper contains 89 sections, 9 theorems, 68 equations, 5 figures, 2 tables.

Key Result

Theorem 5.1

Let $\pi_{\theta}$ be a detached snapshot of the trainable policy $\pi_{\textcolor{red}{\theta}}$ whose parameters coincide at the time of gradient evaluation. For samples $y$ drawn on-policy from $\pi_{\theta}(\cdot|x)$, the following objectives have the same expected gradient as the target in eq:k

Figures (5)

  • Figure 1: Comparison of KL regularization gradient coefficients. Each curve shows the scalar coefficient $c(x,y)$ which would multiply the score function $\nabla_{\textcolor{red}{\theta}}\log \pi_{\textcolor{red}{\theta}}(y|x)$, plotted against $\log \pi_{\theta}(y|x)$ with $\pi_{\text{ref}}(y|x)=0.25$ (vertical dashed line). Principled implementations ('$k_1$ in reward' or '$k_2$ as loss') yield $c=\log\!(\pi_{\theta}/\pi_{\text{ref}})$, a linear restoring force in log-probability. The '$k_3$ as loss' uses $c=1-\pi_{\text{ref}}/\pi_{\theta}$, a first-order Taylor surrogate of $-\log \delta$ at $\delta=\pi_{\text{ref}}/\pi_{\theta}=1$: it is loose when $\log \pi_{\theta}$ is large ($\pi_{\theta}\!\gg\!\pi_{\text{ref}}$) and can blow up when $\log \pi_{\theta}$ is small ($\pi_{\theta}\!\ll\!\pi_{\text{ref}}$). The naive '$k_1$ as loss' gives $c\equiv 1$, producing a zero-mean, non-regularizing gradient in expectation.
  • Figure 2: Comparison of "$\boldsymbol{k_1}$as loss'' versus no KL regularization. The training dynamics are nearly indistinguishable, empirically confirming the theoretical prediction from \ref{['sec:kl_effectiveness']}: '$\boldsymbol{k_1}$as loss' is ineffective as a KL regularizer due to its gradient's independence from the reference model and its zero-mean gradient expectation.
  • Figure 3: Comparison of the principled "$\boldsymbol{k_2}$as loss'' against its first-order surrogate "$\boldsymbol{k_3}$as loss''. Both variants effectively constrain the policy, but '$\boldsymbol{k_2}$as loss' demonstrates superior regularization properties, maintaining a tighter coupling to the reference policy and yielding a more stable optimization path, evidenced by lower reward variance.
  • Figure 4: Comparison of "$\boldsymbol{k_1}$as loss'' versus no KL regularization. The training dynamics are nearly indistinguishable, empirically confirming the theoretical prediction from \ref{['sec:kl_effectiveness']}: '$\boldsymbol{k_1}$as loss' is ineffective as a KL regularizer due to its gradient's independence from the reference model and its zero-mean gradient expectation.
  • Figure 5: Comparison of the principled "$\boldsymbol{k_2}$as loss'' against its first-order surrogate "$\boldsymbol{k_3}$as loss''. Both variants effectively constrain the policy, but '$\boldsymbol{k_2}$as loss' demonstrates superior regularization properties, maintaining a tighter coupling to the reference policy and yielding a more stable optimization path, evidenced by lower reward variance.

Theorems & Definitions (15)

  • Theorem 5.1: On-policy gradient equivalence of principled RKL surrogate losses
  • proof : Sketch (full proof in \ref{['app:kl_gradient_equivalence_proof']})
  • Lemma C.1: Log-derivative identity with detached denominator
  • proof
  • Lemma C.2: Zero-mean score
  • proof
  • Corollary C.0.1: Score-function reweighting
  • Corollary C.0.2: Baseline invariance
  • Lemma E.1: First-order agreement and second-order bias
  • proof : Proof sketch
  • ...and 5 more