Reveal the Mystery of DPO: The Connection between DPO and RL Algorithms
Xuerui Su, Yue Wang, Jinhua Zhu, Mingyang Yi, Feng Xu, Zhiming Ma, Yuting Liu
TL;DR
The paper addresses the theoretical ambiguity surrounding Direct Preference Optimization (DPO) and its relation to Reinforcement Learning from Human Feedback (RLHF). It introduces the UDRRA framework, which unifies four loss-construction scenarios—Boltzmann distribution approximation, reward approximation, reward-difference approximation, and preference-reward approximation—and shows they converge to the same Boltzmann target distribution $\\pi^\\tau$, with $\\tau$ controlling the trade-off toward the optimal policy $\\pi^\\delta$. The work proves target-distribution equivalences among BDA, RA, RDA, and PRA (and their posterior variants), and analyzes distribution-shift between DPO and PRA-P when using offline data, highlighting the impact of $\\tau$ and data design on convergence rate. It further discusses data-selection strategies and theoretical bounds that guide how to accelerate learning in RLHF while managing distribution shift. Collectively, these findings provide a principled lens to compare, deploy, and improve RLHF methods, clarifying when DPO can match or exceed PPO-based approaches and how to mitigate offline-data pitfalls.
Abstract
With the rapid development of Large Language Models (LLMs), numerous Reinforcement Learning from Human Feedback (RLHF) algorithms have been introduced to improve model safety and alignment with human preferences. These algorithms can be divided into two main frameworks based on whether they require an explicit reward (or value) function for training: actor-critic-based Proximal Policy Optimization (PPO) and alignment-based Direct Preference Optimization (DPO). The mismatch between DPO and PPO, such as DPO's use of a classification loss driven by human-preferred data, has raised confusion about whether DPO should be classified as a Reinforcement Learning (RL) algorithm. To address these ambiguities, we focus on three key aspects related to DPO, RL, and other RLHF algorithms: (1) the construction of the loss function; (2) the target distribution at which the algorithm converges; (3) the impact of key components within the loss function. Specifically, we first establish a unified framework named UDRRA connecting these algorithms based on the construction of their loss functions. Next, we uncover their target policy distributions within this framework. Finally, we investigate the critical components of DPO to understand their impact on the convergence rate. Our work provides a deeper understanding of the relationship between DPO, RL, and other RLHF algorithms, offering new insights for improving existing algorithms.
