Table of Contents
Fetching ...

Gradient Imbalance in Direct Preference Optimization

Qinwei Ma, Jingzhe Shi, Can Jin, Jenq-Neng Hwang, Serge Belongie, Lei Li

TL;DR

The paper identifies gradient imbalance in Direct Preference Optimization (DPO) as the root cause of its inconsistent performance relative to PPO-based RLHF. It provides a theoretical analysis of learning dynamics under imbalanced vs balanced losses, and validates these insights with synthetic simulations and LLM experiments. A simple, effective gradient-reweighting approach, Balanced-DPO, is proposed and shown to improve alignment to human preferences, robustness to distribution shifts, and mitigation of OOD overestimation in various settings. The work demonstrates that focusing on how updates propagate during training is crucial for pairwise-feedback methods and outlines a clear direction for making DPO more robust and competitive in real-world tasks.

Abstract

Direct Preference Optimization (DPO) has been proposed as a promising alternative to Proximal Policy Optimization (PPO) based Reinforcement Learning with Human Feedback (RLHF). However, empirical evaluations consistently reveal suboptimal performance in DPO compared to common RLHF pipelines. In this work, we conduct a systematic analysis of DPO's training dynamics and identify gradient imbalance as a critical limitation. We demonstrate theoretically and empirically that this imbalance perturbs optimization trajectories, destabilizes learning, and induces suboptimal convergence. To address this issue, we propose Balanced-DPO, a simple yet effective modification to the DPO objective that introduces a computationally efficient gradient reweighting mechanism. Our experiments demonstrate the effectiveness of Balanced-DPO, validating the theoretical findings and confirming that addressing gradient imbalance is key to improving DPO's performance, highlighting a promising direction for future research.

Gradient Imbalance in Direct Preference Optimization

TL;DR

The paper identifies gradient imbalance in Direct Preference Optimization (DPO) as the root cause of its inconsistent performance relative to PPO-based RLHF. It provides a theoretical analysis of learning dynamics under imbalanced vs balanced losses, and validates these insights with synthetic simulations and LLM experiments. A simple, effective gradient-reweighting approach, Balanced-DPO, is proposed and shown to improve alignment to human preferences, robustness to distribution shifts, and mitigation of OOD overestimation in various settings. The work demonstrates that focusing on how updates propagate during training is crucial for pairwise-feedback methods and outlines a clear direction for making DPO more robust and competitive in real-world tasks.

Abstract

Direct Preference Optimization (DPO) has been proposed as a promising alternative to Proximal Policy Optimization (PPO) based Reinforcement Learning with Human Feedback (RLHF). However, empirical evaluations consistently reveal suboptimal performance in DPO compared to common RLHF pipelines. In this work, we conduct a systematic analysis of DPO's training dynamics and identify gradient imbalance as a critical limitation. We demonstrate theoretically and empirically that this imbalance perturbs optimization trajectories, destabilizes learning, and induces suboptimal convergence. To address this issue, we propose Balanced-DPO, a simple yet effective modification to the DPO objective that introduces a computationally efficient gradient reweighting mechanism. Our experiments demonstrate the effectiveness of Balanced-DPO, validating the theoretical findings and confirming that addressing gradient imbalance is key to improving DPO's performance, highlighting a promising direction for future research.

Paper Structure

This paper contains 37 sections, 16 theorems, 55 equations, 2 figures, 3 tables.

Key Result

Corollary 1

Standard DPO loss is negatively imbalanced.

Figures (2)

  • Figure 1: The distribution of $w_i$ (for DPO) and $w_i\pi_\theta(y_i\mid x)$ (for PPO) in different scenarios. In each figure, the red curve is for PPO, and the blue curve is for DPO; the two colored regions show the $\mu$ and $\sigma$ for both P and Q distributions, where the green region stands for P and the violet region stands for Q. The images in the left column use uniform sampling, while the images in the right column sample according to the current model distribution, which is to simulate the case without any distribution shift. In all figures, the response set is $\mathcal{A}=\{0,1,\cdots,99\}$. In the first row, $\mu_P=45,~\mu_Q=55$; in the second row, $\mu_P=30,~\mu_Q=70$; in the third row, $\mu_P=48, \mu_Q=52$; in all figures, $\sigma_P^2=\sigma_Q^2=100$.
  • Figure 2: Probability distribution of the model output for DPO, where the lighter color indicates a higher probability and vice versa. This result is from the $mask=0.2$ case.

Theorems & Definitions (27)

  • Definition 1
  • Corollary 1
  • Corollary 2
  • Corollary 3
  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 2
  • Lemma 3
  • ...and 17 more