DeDPO: Debiased Direct Preference Optimization for Diffusion Models
Khiem Pham, Quang Nguyen, Tung Nguyen, Jingsen Zhu, Michele Santacatterina, Dimitris Metaxas, Ramin Zabih
TL;DR
DeDPO tackles the scalability bottleneck of preference-based alignment by integrating a debiased, doubly robust estimator into Direct Preference Optimization for diffusion models. It leverages a small set of human preferences together with a large pool of inexpensive synthetic feedback from pretrained AI annotators, while preserving fully offline training. Theoretical guarantees show unbiasedness and favorable convergence that tolerate slower quality in synthetic labels, and experiments demonstrate competitive or superior alignment compared to fully human-labeled baselines, across backbones and synthetic sources. This offers a scalable, robust path to human-AI alignment using inexpensive synthetic supervision.
Abstract
Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to the variations in synthetic labeling methods, achieving performance that matches and occasionally exceeds the theoretical upper bound of models trained on fully human-labeled data. This establishes DeDPO as a scalable solution for human-AI alignment using inexpensive synthetic supervision.
