DeDPO: Debiased Direct Preference Optimization for Diffusion Models

Khiem Pham; Quang Nguyen; Tung Nguyen; Jingsen Zhu; Michele Santacatterina; Dimitris Metaxas; Ramin Zabih

DeDPO: Debiased Direct Preference Optimization for Diffusion Models

Khiem Pham, Quang Nguyen, Tung Nguyen, Jingsen Zhu, Michele Santacatterina, Dimitris Metaxas, Ramin Zabih

TL;DR

DeDPO tackles the scalability bottleneck of preference-based alignment by integrating a debiased, doubly robust estimator into Direct Preference Optimization for diffusion models. It leverages a small set of human preferences together with a large pool of inexpensive synthetic feedback from pretrained AI annotators, while preserving fully offline training. Theoretical guarantees show unbiasedness and favorable convergence that tolerate slower quality in synthetic labels, and experiments demonstrate competitive or superior alignment compared to fully human-labeled baselines, across backbones and synthetic sources. This offers a scalable, robust path to human-AI alignment using inexpensive synthetic supervision.

Abstract

Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to the variations in synthetic labeling methods, achieving performance that matches and occasionally exceeds the theoretical upper bound of models trained on fully human-labeled data. This establishes DeDPO as a scalable solution for human-AI alignment using inexpensive synthetic supervision.

DeDPO: Debiased Direct Preference Optimization for Diffusion Models

TL;DR

Abstract

Paper Structure (16 sections, 4 theorems, 34 equations, 7 figures, 8 tables)

This paper contains 16 sections, 4 theorems, 34 equations, 7 figures, 8 tables.

Introduction
Related Work
Doubly Robust / Debiased estimators.
Preliminaries
Method
Our DeDPO loss is unbiased
Robustness against incorrect synthetic labels
Synthetic preference labels
Fast convergence of DeDPO under slow convergence of the synthetic preference.
Experiments
Experimental setup
Results
Ablation
Conclusion & Discussion
Proof of \ref{['prop:geometric']}.
...and 1 more sections

Key Result

Proposition 1

The expectation of our proposed debiased loss in eq:dr is equal to the expectation of the original DPO loss in eq:dpo, i.e.$L_\text{DeDPO}$ is unbiased:

Figures (7)

Figure 1: Our method achieves performance comparable to models trained on high-quality labels, particularly excelling at capturing subtle details within complex prompts. DeDPO successfully renders challenging elements like the astronaut helmet on Abraham Lincoln's statue, the Statue of Liberty on the lunar carriage, and specific styling details in the pharaoh's steampunk attire.
Figure 2: AI-as-a-judge comparison on 200 PartiPrompt prompts yu2022parti. Gemini 2.5 Flash evaluates each image pair on General Preference, Visual Appeal, and Prompt Alignment. The judge can cast either a win or a tie vote for the image pair.
Figure 3: Ablation on the size of the unlabeled data set. We fix the labeled data size to 1.2K pairs and vary the number of unlabeled pairs to 3K, 8K, 38K and 98K. DeDPO clearly show superior results across varying unlabeled data set size.
Figure 4: Comparison of PickScore distributions for different preference-learning strategies with SD1.5 (top) and SDXL (bottom).
Figure 5: Comparison of preference mechanisms between synthetic (Qwen-VLM) and human labels. Qwen-VLM prioritizes semantic coherence and aesthetic adherence, strictly following artistic style constraints. However, it exhibits limited grounding in photorealism tokens, often failing to associate terms like "8k" or "hyper-detailed" with realistic texturing. Conversely, human preferences show a strong bias toward photorealistic fidelity; human evaluators frequently overlook hallucinations or ignored constraints (e.g., missing details, specific styles like "painting") if the resulting image appears realistic.
...and 2 more figures

Theorems & Definitions (7)

Proposition 1
proof
Proposition 2
Theorem 1: Informal
proof
Lemma 1: Strong convexity
proof

DeDPO: Debiased Direct Preference Optimization for Diffusion Models

TL;DR

Abstract

DeDPO: Debiased Direct Preference Optimization for Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (7)