Table of Contents
Fetching ...

KL Penalty Control via Perturbation for Direct Preference Optimization

Sangkyu Lee, Janghoon Han, Hosung Song, Stanley Jungkyu Choi, Honglak Lee, Youngjae Yu

TL;DR

ε-Direct Preference Optimization (ε-DPO) introduces instance-level adaptive KL penalty control for direct preference optimization by perturbing the KL temperature parameter $β$ and assessing monotonic changes in logit-based preference confidence. The method estimates perturbed policies using re-used logits from the current and reference policies, enabling per-instance adjustment without extra forward passes, thereby achieving more efficient KL-trade-offs than batch- or periodically-updated approaches. Empirical results on UltraFeedback and Anthropic-HH show ε-DPO outperforming DPO and many direct-alignment baselines, while offering clearer signals of when a preference pair is confused and requiring modest additional computation. The work highlights the importance of instance-level KL penalty relaxation for robust, data-dependent alignment of large language models with human preferences, though it depends on maintaining a reference policy and incurs some memory costs.

Abstract

Direct Preference Optimization (DPO) demonstrates the advantage of aligning a large language model with human preference using only an offline dataset. However, DPO has the limitation that the KL penalty, which prevents excessive deviation from the reference model, is static throughout the training process. Several methods claim to change this static KL penalty of DPO into a dynamic one, but no approach can adaptively assign different KL penalties for each preference pair. In this paper, we propose $\varepsilon$-Direct Preference Optimization ($\varepsilon$-DPO), which allows adaptive control of the KL penalty strength $β$ for each preference pair. Specifically, $\varepsilon$-DPO adaptively controls $β$ for each preference pair based on the monotonicity of logits as a preference model under the perturbation of $β$ during training. This is equivalent to adjusting the KL penalty by checking whether the change in training-time temperature can lead to better preference confidence as preference models by simply reusing the logit of the current policy and the reference policy. Experimental results show that the simple criterion of $\varepsilon$-DPO for KL penalty relaxation significantly improves DPO compared to most existing direct alignment algorithms on general chatbot benchmarks and reveal that this KL penalty control criterion can reflect confusion as a preference model and provide an efficient KL trade-off, highlighting the significance of instance-level adaptive KL penalty control in DPO.

KL Penalty Control via Perturbation for Direct Preference Optimization

TL;DR

ε-Direct Preference Optimization (ε-DPO) introduces instance-level adaptive KL penalty control for direct preference optimization by perturbing the KL temperature parameter and assessing monotonic changes in logit-based preference confidence. The method estimates perturbed policies using re-used logits from the current and reference policies, enabling per-instance adjustment without extra forward passes, thereby achieving more efficient KL-trade-offs than batch- or periodically-updated approaches. Empirical results on UltraFeedback and Anthropic-HH show ε-DPO outperforming DPO and many direct-alignment baselines, while offering clearer signals of when a preference pair is confused and requiring modest additional computation. The work highlights the importance of instance-level KL penalty relaxation for robust, data-dependent alignment of large language models with human preferences, though it depends on maintaining a reference policy and incurs some memory costs.

Abstract

Direct Preference Optimization (DPO) demonstrates the advantage of aligning a large language model with human preference using only an offline dataset. However, DPO has the limitation that the KL penalty, which prevents excessive deviation from the reference model, is static throughout the training process. Several methods claim to change this static KL penalty of DPO into a dynamic one, but no approach can adaptively assign different KL penalties for each preference pair. In this paper, we propose -Direct Preference Optimization (-DPO), which allows adaptive control of the KL penalty strength for each preference pair. Specifically, -DPO adaptively controls for each preference pair based on the monotonicity of logits as a preference model under the perturbation of during training. This is equivalent to adjusting the KL penalty by checking whether the change in training-time temperature can lead to better preference confidence as preference models by simply reusing the logit of the current policy and the reference policy. Experimental results show that the simple criterion of -DPO for KL penalty relaxation significantly improves DPO compared to most existing direct alignment algorithms on general chatbot benchmarks and reveal that this KL penalty control criterion can reflect confusion as a preference model and provide an efficient KL trade-off, highlighting the significance of instance-level adaptive KL penalty control in DPO.

Paper Structure

This paper contains 33 sections, 2 theorems, 19 equations, 5 figures, 11 tables, 1 algorithm.

Key Result

Proposition 1

Under the assumption of optimal autoregressive policy $\pi^*$ where the prompt $x \in \mathcal{X}$, response vocabulary $y_i \in \mathcal{V}$, and logit $f: \mathcal{X} \times \mathcal{V}^{i-1} \rightarrow \mathbb{R}^{|\mathcal{V}|}$, the optimal policy $\pi^*_\frac{\beta}{\lambda}$ can be approxima Proof. See app:dera.

Figures (5)

  • Figure 1: $\varepsilon$-DPO adaptively controls $\beta$ corresponding to the KL penalty strength for each preference pair by checking monotonicity of the log-likelihood ratio of the chosen response and the rejected according to perturbation of training-time $\beta$ by estimating the perturbed policies by reusing logits.
  • Figure 2: Comparison between $\varepsilon$-DPO and existing KL penalty relaxation methods for DPO, $\beta$-DPO wu2024beta and TR-DPO gorbatovski2024learn. Only $\varepsilon$-DPO achieves instance-level KL penalty relaxation compared to other methods, which control $\beta$ at batch-level or update the reference policy periodically.
  • Figure 3: Intra-epoch training dynamics of Llama-3-Instruct according to the change of $\varepsilon$. We additionally plot the fitted curves of AlpacaEval 2 LC results of each checkpoint and exponential moving average lines of the in-batch occurrence ratio on $\beta^-_\varepsilon$ and $\beta^+_\varepsilon$ for better visual representation.
  • Figure 4: (a) Implicit reward margin of pairs showing logit monotonicity in policies trained with DPO under various $\beta$. Each error bar indicates the 0.95 confidence interval. (b) Pareto frontier between KL divergence and win rate, which is measured by comparing with chosen responses in the test split.
  • Figure 5: Changes of upper bound of $\varepsilon$ consistently satisfying the monotonically decreasing or increasing criterion with the 0.95 confidence band.

Theorems & Definitions (2)

  • Proposition 1: liu2024decoding
  • Proposition 1: liu2024decoding