Table of Contents
Fetching ...

What Is Preference Optimization Doing, How and Why?

Yue Wang, Qizhou Wang, Zizhuo Zhang, Ang Li, Gang Niu, Bo Han, Masashi Sugiyama

TL;DR

The paper investigates why DPO often behaves like supervised learning while PPO exhibits reinforcement-learning-like dynamics in PO for LLM alignment. By introducing a gradient-alignment framework, it decomposes PO into positive learning, negative learning, and loss reweighting, and analyzes their distinct roles. Empirical results show DPO yields stable targets whereas PPO promotes exploration, with loss reweighting serving method-specific functions and negative learning sometimes acting as a regularizer. The authors further demonstrate that coordinating learning components (via cDPO, cPPO, hPPO) can improve performance and propose Coordinated Preference Optimization as a future direction to optimize these dynamics in real time.

Abstract

Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO follows dynamic targets that balance exploration and exploitation, thus validating the common belief from a new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key components in PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the learning targets meanwhile mutually offset each other. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to absolute values of token-level advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.

What Is Preference Optimization Doing, How and Why?

TL;DR

The paper investigates why DPO often behaves like supervised learning while PPO exhibits reinforcement-learning-like dynamics in PO for LLM alignment. By introducing a gradient-alignment framework, it decomposes PO into positive learning, negative learning, and loss reweighting, and analyzes their distinct roles. Empirical results show DPO yields stable targets whereas PPO promotes exploration, with loss reweighting serving method-specific functions and negative learning sometimes acting as a regularizer. The authors further demonstrate that coordinating learning components (via cDPO, cPPO, hPPO) can improve performance and propose Coordinated Preference Optimization as a future direction to optimize these dynamics in real time.

Abstract

Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO follows dynamic targets that balance exploration and exploitation, thus validating the common belief from a new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key components in PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the learning targets meanwhile mutually offset each other. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to absolute values of token-level advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.

Paper Structure

This paper contains 21 sections, 28 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: DPO Learning Dynamics. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we show the dynamics of $\mathcal{G}$ measured per $1000$ training steps: (a) the overall objective $\mathcal{L}_{\mathrm{dpo}}$ (TOT); (b) the positive $\mathcal{L}_{\mathrm{dpo}}^{+}$ (POS) and negative $\mathcal{L}_{\mathrm{dpo}}^{-}$ (NEG) components; and (c) the weighted top $\mathcal{L}^{\uparrow}_{\mathrm{dpo}}$ (TOP), middle $\mathcal{L}^{\rightarrow}_{\mathrm{dpo}}$ (MID), and bottom $\mathcal{L}^{\downarrow}_{\mathrm{dpo}}$ (BOT) components. The log scale is used for $\mathcal{G}$ due to its span across several orders of magnitude.
  • Figure 2: PPO Learning Dynamics. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we show the dynamics of $\mathcal{G}$ measured per $400$ training steps: (a) the overall objective $\mathcal{L}_{\mathrm{ppo}}$ (TOT); (b) the positive $\mathcal{L}_{\mathrm{ppo}}^{+}$ (POS) and negative $\mathcal{L}_{\mathrm{ppo}}^{-}$ (NEG) components; and (c) the weighted top $\mathcal{L}^{\uparrow}_{\mathrm{ppo}}$ (TOP), middle $\mathcal{L}^{\rightarrow}_{\mathrm{ppo}}$ (MID), and bottom $\mathcal{L}^{\downarrow}_{\mathrm{ppo}}$ (BOT) components. The log scale is used for $\mathcal{G}$ to align with Figure \ref{['fig:dpo behave']}.
  • Figure 3: Average (Raw) Advantages during PPO for top (TOP), middle (MID), and bottom (BOT) weighted data.
  • Figure 4: Performance under Ablation. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we show performance measured by Win Rate under: (a) DPO ablations removing negative learning (w/o NEG) before 3000 steps and positive learning (w/o POS) after 6000 steps, (b) PPO ablations removing top (w/o TOP) and middle (w/o MID) weighted data, (c) cDPO that emphasizes positive learning early and negative learning later, where we instantiate three representative dynamic-parameter settings: Case 1, Case 2, and Case 3 apply the cDPO coordination between training steps $(t_1,t_2)=(2500,5500)$, $(2500,6500)$, and $(3000,8000)$ respectively, thereby balancing positive and negative learning, (d) cPPO that downweights top data where we examine varying degrees of downweighting: Case 1, Case 2, and Case 3 apply coefficients $\lambda=0.3$, $\lambda=0.7$, and $\lambda=0.9$ to the top-weighted samples, respectively, , (e) cPPO that downweights middle data, where Case 1, Case 2, and Case 3 apply coefficients $\lambda=0.7$, $\lambda=0.5$, and $\lambda=0.9$ to the middle-weighted samples, respectively, and (f) hPPO that changes learning behaviors periodically, where we vary the period $t_3$ and amplitude $\tau$ across three representative settings: Case 1, Case 2, and Case 3 correspond to $(t_3,\tau)=(2,0.08)$, $(5,0.05)$, and $(20,0.01)$, respectively. We present illustrative results here, with additional results and best performance in Appendix \ref{['app:coordinate']}.
  • Figure 5: Gradient Magnitudes. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we illustrate the distributions of gradient magnitudes computed with respect to mini-batches for DPO and PPO, across training steps. Normal data points are colored in blue, while outliers detected by IQR are colored in red.
  • ...and 14 more figures