Discriminator-Free Direct Preference Optimization for Video Diffusion
Haoran Cheng, Qide Dong, Liang Peng, Zhizhou Sha, Weiguo Feng, Jinghui Xie, Zhao Song, Shilei Wen, Xiaofei He, Boxi Wu
TL;DR
This paper tackles the practicality of aligning video diffusion models with human preferences by introducing a discriminator-free Direct Preference Optimization (DPO) framework that replaces costly generated-pair comparisons with real/edited video pairs. The approach leverages simple editing operations to create informative win/lose signals, enabling unlimited data expansion and eliminating reliance on discriminators. The authors provide theoretical guarantees that cross-distribution training remains effective and demonstrate equivalence to the Bradley–Terry model for human preference modeling, along with an optimal policy formulation for video generation. Empirically, they validate DF-DPO on CogVideoX, showing improved human-preference alignment over supervised fine-tuning baselines and existing methods, with ablations highlighting the benefits of combined temporal-spatial distortions. The work offers a scalable, signal-clarifying path for high-quality, temporally coherent video generation with practical implications for large-scale video synthesis and alignment tasks.
Abstract
Direct Preference Optimization (DPO), which aligns models with human preferences through win/lose data pairs, has achieved remarkable success in language and image generation. However, applying DPO to video diffusion models faces critical challenges: (1) Data inefficiency. Generating thousands of videos per DPO iteration incurs prohibitive costs; (2) Evaluation uncertainty. Human annotations suffer from subjective bias, and automated discriminators fail to detect subtle temporal artifacts like flickering or motion incoherence. To address these, we propose a discriminator-free video DPO framework that: (1) Uses original real videos as win cases and their edited versions (e.g., reversed, shuffled, or noise-corrupted clips) as lose cases; (2) Trains video diffusion models to distinguish and avoid artifacts introduced by editing. This approach eliminates the need for costly synthetic video comparisons, provides unambiguous quality signals, and enables unlimited training data expansion through simple editing operations. We theoretically prove the framework's effectiveness even when real videos and model-generated videos follow different distributions. Experiments on CogVideoX demonstrate the efficiency of the proposed method.
