Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

Kezhao Liu; Jason Klein Liu; Mingtao Chen; Yiming Liu

Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

Kezhao Liu, Jason Klein Liu, Mingtao Chen, Yiming Liu

TL;DR

This work reframes KL regularization in RLHF from value estimation to gradient optimization, introducing a unified framework that connects KL terms used as reward coefficients with those used as explicit losses. It proves that under on-policy conditions, the conventional '$k_1$ in reward' and '$k_2$ as loss' are gradient-equivalent and constitute the principled KL gradient, while '$k_3$ as loss' is a biased first-order approximation. The authors also show that off-policy implementations require proper importance sampling corrections, and they propose principled alternatives (e.g., bounded losses) for stability. Empirical results on math reasoning tasks corroborate the theory, demonstrating that '$k_1$ as loss' is ineffective, '$k_2$ as loss' provides stronger regularization, and '$k_3$ as loss' can induce instability, thereby guiding robust RLHF design.

Abstract

Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from numerical value estimation-a practice that overlooks the term's functional role as an optimization loss. To analyze this issue, we establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term $k_n$ as a detached coefficient for the policy's score function ('$k_n$ in reward') or as a direct loss function through which gradients are propagated ('$k_n$ as loss'). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we prove that the conventional '$k_1$ in reward' (like in PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditions, the '$k_2$ as loss' formulation is, in fact, gradient-equivalent to '$k_1$ in reward'. This equivalence, first proven in our work, identifies both as the theoretically sound implementations of the RKL objective. In contrast, we show that the recently adopted '$k_3$ as loss' (like in GRPO) is merely a first-order, biased approximation of the principled loss. Furthermore, we argue that common off-policy implementations of '$k_n$ as loss' methods are biased due to neglected importance sampling, and we propose a principled correction. Our findings provide a comprehensive, gradient-based rationale for choosing and correctly implementing KL regularization, paving the way for more robust and effective RLHF systems.

Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

TL;DR

in reward' and '

as loss' are gradient-equivalent and constitute the principled KL gradient, while '

as loss' is a biased first-order approximation. The authors also show that off-policy implementations require proper importance sampling corrections, and they propose principled alternatives (e.g., bounded losses) for stability. Empirical results on math reasoning tasks corroborate the theory, demonstrating that '

as loss' is ineffective, '

as loss' provides stronger regularization, and '

as loss' can induce instability, thereby guiding robust RLHF design.

Abstract

as a detached coefficient for the policy's score function ('

in reward') or as a direct loss function through which gradients are propagated ('

as loss'). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we prove that the conventional '

in reward' (like in PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditions, the '

as loss' formulation is, in fact, gradient-equivalent to '

in reward'. This equivalence, first proven in our work, identifies both as the theoretically sound implementations of the RKL objective. In contrast, we show that the recently adopted '

as loss' (like in GRPO) is merely a first-order, biased approximation of the principled loss. Furthermore, we argue that common off-policy implementations of '

as loss' methods are biased due to neglected importance sampling, and we propose a principled correction. Our findings provide a comprehensive, gradient-based rationale for choosing and correctly implementing KL regularization, paving the way for more robust and effective RLHF systems.

Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

TL;DR

Abstract

Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (15)