Table of Contents
Fetching ...

3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Yuzi Yan, Yibo Miao, Jialian Li, Yipin Zhang, Jian Xie, Zhijie Deng, Dong Yan

TL;DR

This work investigates why reward-model-free Direct Preference Optimization struggles to reach RM-based RLHF PPO performance. It identifies three instability phenomena, the 3D-properties, arising from the interaction of chosen and rejected gradients, and validates them with a toy model and real LLM tasks in math and instruction-following. It then proposes regularization approaches—Flex-DPO with adaptive beta and SFT-DPO—that mitigate these instabilities and improve stability, especially under on-policy data. By contrasting DPO with RM-based alignment, the authors explain the gap in performance and provide concrete guidance for advancing reward-model-free preference learning toward the efficacy of PPO-based methods. The results offer practical insights into optimization dynamics, data distribution effects, and avenues for future research in alignment without explicit reward models.

Abstract

Aligning large language models (LLMs) with human preferences has gained significant attention, with Proximal Policy Optimization (PPO) as a standard yet computationally expensive method and Direct Preference Optimization (DPO) as a more efficient alternative. While DPO offers simplicity, it remains underutilized in state-of-the-art LLMs, suggesting potential limitations. In this work, we revisit DPO, analyzing its theoretical foundations and empirical performance to bridge this gap. We identify three key properties, termed 3D properties, that emerge from DPO's learning process: Drastic drop in rejected response likelihood, Degradation into response suppression, and Dispersion effect on unseen responses. We show that these issues arise from DPO's optimization dynamics, where the interaction between chosen and rejected response gradients leads to instability. Our findings are supported by experiments on both a controlled toy model and real-world LLM tasks, including mathematical problem-solving and instruction following. To address these challenges, we propose simple regularization techniques that improve training stability and performance. Additionally, we examine how preference data distribution impacts DPO's effectiveness, offering insights into how alignment models handle out-of-domain (OOD) data. Our work connects these observations to broader research and provides a theoretical explanation for DPO's limitations. We hope these insights will guide future advancements in reward-model-free preference learning, bringing it closer to reward-model-based approaches.

3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

TL;DR

This work investigates why reward-model-free Direct Preference Optimization struggles to reach RM-based RLHF PPO performance. It identifies three instability phenomena, the 3D-properties, arising from the interaction of chosen and rejected gradients, and validates them with a toy model and real LLM tasks in math and instruction-following. It then proposes regularization approaches—Flex-DPO with adaptive beta and SFT-DPO—that mitigate these instabilities and improve stability, especially under on-policy data. By contrasting DPO with RM-based alignment, the authors explain the gap in performance and provide concrete guidance for advancing reward-model-free preference learning toward the efficacy of PPO-based methods. The results offer practical insights into optimization dynamics, data distribution effects, and avenues for future research in alignment without explicit reward models.

Abstract

Aligning large language models (LLMs) with human preferences has gained significant attention, with Proximal Policy Optimization (PPO) as a standard yet computationally expensive method and Direct Preference Optimization (DPO) as a more efficient alternative. While DPO offers simplicity, it remains underutilized in state-of-the-art LLMs, suggesting potential limitations. In this work, we revisit DPO, analyzing its theoretical foundations and empirical performance to bridge this gap. We identify three key properties, termed 3D properties, that emerge from DPO's learning process: Drastic drop in rejected response likelihood, Degradation into response suppression, and Dispersion effect on unseen responses. We show that these issues arise from DPO's optimization dynamics, where the interaction between chosen and rejected response gradients leads to instability. Our findings are supported by experiments on both a controlled toy model and real-world LLM tasks, including mathematical problem-solving and instruction following. To address these challenges, we propose simple regularization techniques that improve training stability and performance. Additionally, we examine how preference data distribution impacts DPO's effectiveness, offering insights into how alignment models handle out-of-domain (OOD) data. Our work connects these observations to broader research and provides a theoretical explanation for DPO's limitations. We hope these insights will guide future advancements in reward-model-free preference learning, bringing it closer to reward-model-based approaches.
Paper Structure (36 sections, 5 theorems, 46 equations, 9 figures, 9 tables)

This paper contains 36 sections, 5 theorems, 46 equations, 9 figures, 9 tables.

Key Result

Corollary 1

The ratio of the gradient with respect to the rejected response likelihood $\pi^-$ to the gradient with respect to the chosen response likelihood $\pi^+$ is equal to the ratio of $\pi^+$ to $\pi^-$: which indicates that as $\pi^+$ increases and $\pi^-$ decreases, the gradient with respect to $\pi^-$ grows faster, leading to a more rapid decline in the likelihood of the rejected response.

Figures (9)

  • Figure 1: Toy model setup. Top left: the optimal policy where the highlighted blocks represent optimal responses. Top right: preference dataset construction. Lower left: the initialization of the SFT model. Lower right: policy output after DPO training.
  • Figure 2: Dynamic optimization process with vanilla DPO using the toy model. Left: likelihood dynamics over training epochs. The blue curve represents the average likelihood of chosen responses, yellow shows the minimum for chosen responses, green represents the average for rejected responses, red shows the maximum for rejected responses, and purple represents the average for unseen responses. Middle: dynamics of averaged $\frac{\partial \ell^{DPO}}{\partial \pi^+}$ and $\frac{\partial \ell^{DPO}}{\partial \pi^-}$ over training epochs. Right: likelihood dynamics over training epochs on a log scale, highlighting the drastic drop in the likelihood of rejected responses.
  • Figure 3: From left to right, the figures show the initial state and the likelihood dynamics for chosen/rejected/unseen responses in Scenarios 1 to 4, similar to the left diagram in \ref{['fig:dpo results on toy model']}: (1) both chosen and rejected responses are on-policy, (2) chosen off-policy and rejected on-policy, (3) chosen on-policy and rejected off-policy, and (4) both off-policy.
  • Figure 4: Performance on poem generation, $\beta^-$ varying with $\beta^+=0.1$.
  • Figure 5: Accuracy of RM and DPO on HH-rlhf eval set over the training process.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Corollary 1: Explanation for Property \ref{['property: drastic drop in rejected response likelihood']}
  • Corollary 2: Explanation for Property \ref{['property: degradation into LLM unlearning']}
  • Corollary 3: Explanation for Property \ref{['property: dispersion effect on unseen responses']}
  • Proposition 1
  • Lemma 1
  • Remark 1