Gradient Imbalance in Direct Preference Optimization
Qinwei Ma, Jingzhe Shi, Can Jin, Jenq-Neng Hwang, Serge Belongie, Lei Li
TL;DR
The paper identifies gradient imbalance in Direct Preference Optimization (DPO) as the root cause of its inconsistent performance relative to PPO-based RLHF. It provides a theoretical analysis of learning dynamics under imbalanced vs balanced losses, and validates these insights with synthetic simulations and LLM experiments. A simple, effective gradient-reweighting approach, Balanced-DPO, is proposed and shown to improve alignment to human preferences, robustness to distribution shifts, and mitigation of OOD overestimation in various settings. The work demonstrates that focusing on how updates propagate during training is crucial for pairwise-feedback methods and outlines a clear direction for making DPO more robust and competitive in real-world tasks.
Abstract
Direct Preference Optimization (DPO) has been proposed as a promising alternative to Proximal Policy Optimization (PPO) based Reinforcement Learning with Human Feedback (RLHF). However, empirical evaluations consistently reveal suboptimal performance in DPO compared to common RLHF pipelines. In this work, we conduct a systematic analysis of DPO's training dynamics and identify gradient imbalance as a critical limitation. We demonstrate theoretically and empirically that this imbalance perturbs optimization trajectories, destabilizes learning, and induces suboptimal convergence. To address this issue, we propose Balanced-DPO, a simple yet effective modification to the DPO objective that introduces a computationally efficient gradient reweighting mechanism. Our experiments demonstrate the effectiveness of Balanced-DPO, validating the theoretical findings and confirming that addressing gradient imbalance is key to improving DPO's performance, highlighting a promising direction for future research.
