Table of Contents
Fetching ...

Unifying Stable Optimization and Reference Regularization in RLHF

Li He, Qiang Qu, He Zhao, Stephen Wan, Dadong Wang, Lina Yao, Tongliang Liu

TL;DR

The paper tackles reward hacking and unstable optimization in RLHF by unifying two regularization mechanisms through a dual-KL objective. It introduces an interpolated reference target $\pi_{ref} \propto \pi_0^{\alpha} \pi_t^{1-\alpha}$ and derives a practical DAR algorithm that reframes alignment as a weighted supervised fine-tuning problem with a closed-form optimal policy. Theoretical analysis shows the dual-KL objective dynamically adapts the reference target as learning progresses, expanding the search space beyond the initial policy while maintaining stability. Empirically, DAR outperforms online RLHF and DAP baselines across diverse tasks, achieving superior reward/regularization trade-offs and improved learning stability, with ablations confirming the necessity of the regression-based formulation. The work offers a computationally efficient, simpler alternative to PPO-based RLHF and provides a foundation for broader applications of dual-KL regularization in LLM alignment.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has advanced alignment capabilities significantly but remains hindered by two core challenges: \textbf{reward hacking} and \textbf{stable optimization}. Current solutions independently address these issues through separate regularization strategies, specifically a KL-divergence penalty against a supervised fine-tuned model ($π_0$) to mitigate reward hacking, and policy ratio clipping towards the current policy ($π_t$) to promote stable alignment. However, the implicit trade-off arising from simultaneously regularizing towards both $π_0$ and $π_t$ remains under-explored. In this paper, we introduce a unified regularization approach that explicitly balances the objectives of preventing reward hacking and maintaining stable policy updates. Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss with a superior trade-off, which demonstrably improves both alignment results and implementation complexity. Extensive experiments across diverse benchmarks validate that our method consistently outperforms RLHF and online preference learning methods, achieving enhanced alignment performance and stability.

Unifying Stable Optimization and Reference Regularization in RLHF

TL;DR

The paper tackles reward hacking and unstable optimization in RLHF by unifying two regularization mechanisms through a dual-KL objective. It introduces an interpolated reference target and derives a practical DAR algorithm that reframes alignment as a weighted supervised fine-tuning problem with a closed-form optimal policy. Theoretical analysis shows the dual-KL objective dynamically adapts the reference target as learning progresses, expanding the search space beyond the initial policy while maintaining stability. Empirically, DAR outperforms online RLHF and DAP baselines across diverse tasks, achieving superior reward/regularization trade-offs and improved learning stability, with ablations confirming the necessity of the regression-based formulation. The work offers a computationally efficient, simpler alternative to PPO-based RLHF and provides a foundation for broader applications of dual-KL regularization in LLM alignment.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has advanced alignment capabilities significantly but remains hindered by two core challenges: \textbf{reward hacking} and \textbf{stable optimization}. Current solutions independently address these issues through separate regularization strategies, specifically a KL-divergence penalty against a supervised fine-tuned model () to mitigate reward hacking, and policy ratio clipping towards the current policy () to promote stable alignment. However, the implicit trade-off arising from simultaneously regularizing towards both and remains under-explored. In this paper, we introduce a unified regularization approach that explicitly balances the objectives of preventing reward hacking and maintaining stable policy updates. Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss with a superior trade-off, which demonstrably improves both alignment results and implementation complexity. Extensive experiments across diverse benchmarks validate that our method consistently outperforms RLHF and online preference learning methods, achieving enhanced alignment performance and stability.
Paper Structure (33 sections, 2 theorems, 27 equations, 8 figures, 14 tables, 1 algorithm)

This paper contains 33 sections, 2 theorems, 27 equations, 8 figures, 14 tables, 1 algorithm.

Key Result

Proposition 4.1

The dual-KL advantage maximization objective in eq:dar_dual_obj is equivalent to optimizing against an interpolated reference policy in log-space: where $C(x)=\sum_{y}\pi_\textnormal{0}(y|x)^{\alpha} \, \pi_t(y|x)^{1-\alpha}$ is the normalizing factor for the effective reference target. The proof is provided in Appendix sec:pps_proof.

Figures (8)

  • Figure 1: Dual-KL regularization enables exploration beyond reference policy support. (a) PPO-based RLHF uses policy ratio clipping relative to $\pi_t$ for stable optimization and KL divergence penalty relative to $\pi_0$ for reference regularization. High-reward regions remain unexplored when they lack sufficient support under the reference policy. (b) Our approach unifies stable optimization and reference regularization, enabling flexible trade-offs between the two mechanisms. This allows the policy to expand into high-reward regions previously inaccessible due to limited reference support, achieving better alignment when substantial behavioral changes are required. (c) Empirical validation on Anthropic-Helpfulness dataset: incorporating dual-KL penalties in advantage estimation improves the reward-KL Pareto frontier over standard PPO for both Dual-PPO variants.
  • Figure 2: Log-likelihood interpolation creates a reference target that provides better support for the optimal policy distribution.
  • Figure 3: Reference win rate curves of DAR against DAP methods and online RLHF methods. The base policy is Qwen2-7B, and the LLM annotator is Qwen2-72B-Instruct. Win rates are evaluated by GPT-4-Turbo on a random test set of 1,000 examples. Shaded regions indicate 95% confidence intervals across 3 random seeds.
  • Figure 4: Pareto analysis of reward/KL regularization trade-off by sweeping $\beta$. Each marker represents a 1k evaluation using Qwen2-72B-Instruct as the annotator fine-tuning Qwen2-7B. Solid lines show second-order polynomial fits, while dashed lines indicate peak fitted reward for DAR.
  • Figure 5: DAR vs. RL-based Dual-KL (DAO, Dual-PPO) on (a) TL;DR, and (b) Helpfulness. Ablation studies on the Helpfulness task: (c) Trade-off coefficient $\alpha$ on alignment results; (d) Total regularization coefficient $\beta$ on win rate; (e) N-shot sampling size on win rate for DAR and Monte-Carlo baseline methods. (f) Weight Clip threshold $w_\text{clip}$ on reward for DAR across three datasets. The shaded area in (a), (b), (d) represents the 95% confidence interval over 3 seed.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition 4.1
  • Theorem 4.2