Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization
Kezhao Liu, Jason Klein Liu, Mingtao Chen, Yiming Liu
TL;DR
This work reframes KL regularization in RLHF from value estimation to gradient optimization, introducing a unified framework that connects KL terms used as reward coefficients with those used as explicit losses. It proves that under on-policy conditions, the conventional '$k_1$ in reward' and '$k_2$ as loss' are gradient-equivalent and constitute the principled KL gradient, while '$k_3$ as loss' is a biased first-order approximation. The authors also show that off-policy implementations require proper importance sampling corrections, and they propose principled alternatives (e.g., bounded losses) for stability. Empirical results on math reasoning tasks corroborate the theory, demonstrating that '$k_1$ as loss' is ineffective, '$k_2$ as loss' provides stronger regularization, and '$k_3$ as loss' can induce instability, thereby guiding robust RLHF design.
Abstract
Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from numerical value estimation-a practice that overlooks the term's functional role as an optimization loss. To analyze this issue, we establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term $k_n$ as a detached coefficient for the policy's score function ('$k_n$ in reward') or as a direct loss function through which gradients are propagated ('$k_n$ as loss'). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we prove that the conventional '$k_1$ in reward' (like in PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditions, the '$k_2$ as loss' formulation is, in fact, gradient-equivalent to '$k_1$ in reward'. This equivalence, first proven in our work, identifies both as the theoretically sound implementations of the RKL objective. In contrast, we show that the recently adopted '$k_3$ as loss' (like in GRPO) is merely a first-order, biased approximation of the principled loss. Furthermore, we argue that common off-policy implementations of '$k_n$ as loss' methods are biased due to neglected importance sampling, and we propose a principled correction. Our findings provide a comprehensive, gradient-based rationale for choosing and correctly implementing KL regularization, paving the way for more robust and effective RLHF systems.
