Table of Contents
Fetching ...

A Survey of Direct Preference Optimization

Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, Yongbin Li, Dacheng Tao

TL;DR

The paper surveys Direct Preference Optimization (DPO) as a scalable alternative to RLHF for aligning large language models with human preferences. It introduces a four-dimensional taxonomy—data strategy, learning framework, constraint mechanism, and model property—to organize DPO variants and summarizes a rigorous empirical analysis across standard benchmarks. The authors discuss real-world applications in LLMs, diffusion models, and multi-modal systems, and provide a forward-looking view on efficiency, multi-modal alignment, continuous adaptation, and interpretability. By consolidating theoretical foundations, empirical findings, and practical guidance, the work aims to catalyze robust and generalizable preference-alignment methods for diverse AI deployments.

Abstract

Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful paradigm for aligning LLMs with human preferences, its reliance on complex reward modeling introduces inherent trade-offs in computational efficiency and training stability. In this context, Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative that directly optimizes LLMs using human preferences, thereby circumventing the need for explicit reward modeling. Owing to its theoretical elegance and computational efficiency, DPO has rapidly attracted substantial research efforts exploring its various implementations and applications. However, this field currently lacks systematic organization and comparative analysis. In this survey, we conduct a comprehensive overview of DPO and introduce a novel taxonomy, categorizing previous works into four key dimensions: data strategy, learning framework, constraint mechanism, and model property. We further present a rigorous empirical analysis of DPO variants across standardized benchmarks. Additionally, we discuss real-world applications, open challenges, and future directions for DPO. This work delivers both a conceptual framework for understanding DPO and practical guidance for practitioners, aiming to advance robust and generalizable alignment paradigms. All collected resources are available and will be continuously updated at https://github.com/liushunyu/awesome-direct-preference-optimization.

A Survey of Direct Preference Optimization

TL;DR

The paper surveys Direct Preference Optimization (DPO) as a scalable alternative to RLHF for aligning large language models with human preferences. It introduces a four-dimensional taxonomy—data strategy, learning framework, constraint mechanism, and model property—to organize DPO variants and summarizes a rigorous empirical analysis across standard benchmarks. The authors discuss real-world applications in LLMs, diffusion models, and multi-modal systems, and provide a forward-looking view on efficiency, multi-modal alignment, continuous adaptation, and interpretability. By consolidating theoretical foundations, empirical findings, and practical guidance, the work aims to catalyze robust and generalizable preference-alignment methods for diverse AI deployments.

Abstract

Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful paradigm for aligning LLMs with human preferences, its reliance on complex reward modeling introduces inherent trade-offs in computational efficiency and training stability. In this context, Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative that directly optimizes LLMs using human preferences, thereby circumventing the need for explicit reward modeling. Owing to its theoretical elegance and computational efficiency, DPO has rapidly attracted substantial research efforts exploring its various implementations and applications. However, this field currently lacks systematic organization and comparative analysis. In this survey, we conduct a comprehensive overview of DPO and introduce a novel taxonomy, categorizing previous works into four key dimensions: data strategy, learning framework, constraint mechanism, and model property. We further present a rigorous empirical analysis of DPO variants across standardized benchmarks. Additionally, we discuss real-world applications, open challenges, and future directions for DPO. This work delivers both a conceptual framework for understanding DPO and practical guidance for practitioners, aiming to advance robust and generalizable alignment paradigms. All collected resources are available and will be continuously updated at https://github.com/liushunyu/awesome-direct-preference-optimization.

Paper Structure

This paper contains 38 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: A taxonomy of DPO. We categorize existing DPO works into four branches: data strategy, learning framework, constraint mechanism, and model property. Different colored boxes indicate different categories and their corresponding representative references.
  • Figure 2: An overview of DPO data strategy.
  • Figure 3: An overview of DPO learning framework.
  • Figure 4: An overview of DPO constraint mechanism.
  • Figure 5: An overview of DPO model property.