Table of Contents
Fetching ...

Discriminator-Free Direct Preference Optimization for Video Diffusion

Haoran Cheng, Qide Dong, Liang Peng, Zhizhou Sha, Weiguo Feng, Jinghui Xie, Zhao Song, Shilei Wen, Xiaofei He, Boxi Wu

TL;DR

This paper tackles the practicality of aligning video diffusion models with human preferences by introducing a discriminator-free Direct Preference Optimization (DPO) framework that replaces costly generated-pair comparisons with real/edited video pairs. The approach leverages simple editing operations to create informative win/lose signals, enabling unlimited data expansion and eliminating reliance on discriminators. The authors provide theoretical guarantees that cross-distribution training remains effective and demonstrate equivalence to the Bradley–Terry model for human preference modeling, along with an optimal policy formulation for video generation. Empirically, they validate DF-DPO on CogVideoX, showing improved human-preference alignment over supervised fine-tuning baselines and existing methods, with ablations highlighting the benefits of combined temporal-spatial distortions. The work offers a scalable, signal-clarifying path for high-quality, temporally coherent video generation with practical implications for large-scale video synthesis and alignment tasks.

Abstract

Direct Preference Optimization (DPO), which aligns models with human preferences through win/lose data pairs, has achieved remarkable success in language and image generation. However, applying DPO to video diffusion models faces critical challenges: (1) Data inefficiency. Generating thousands of videos per DPO iteration incurs prohibitive costs; (2) Evaluation uncertainty. Human annotations suffer from subjective bias, and automated discriminators fail to detect subtle temporal artifacts like flickering or motion incoherence. To address these, we propose a discriminator-free video DPO framework that: (1) Uses original real videos as win cases and their edited versions (e.g., reversed, shuffled, or noise-corrupted clips) as lose cases; (2) Trains video diffusion models to distinguish and avoid artifacts introduced by editing. This approach eliminates the need for costly synthetic video comparisons, provides unambiguous quality signals, and enables unlimited training data expansion through simple editing operations. We theoretically prove the framework's effectiveness even when real videos and model-generated videos follow different distributions. Experiments on CogVideoX demonstrate the efficiency of the proposed method.

Discriminator-Free Direct Preference Optimization for Video Diffusion

TL;DR

This paper tackles the practicality of aligning video diffusion models with human preferences by introducing a discriminator-free Direct Preference Optimization (DPO) framework that replaces costly generated-pair comparisons with real/edited video pairs. The approach leverages simple editing operations to create informative win/lose signals, enabling unlimited data expansion and eliminating reliance on discriminators. The authors provide theoretical guarantees that cross-distribution training remains effective and demonstrate equivalence to the Bradley–Terry model for human preference modeling, along with an optimal policy formulation for video generation. Empirically, they validate DF-DPO on CogVideoX, showing improved human-preference alignment over supervised fine-tuning baselines and existing methods, with ablations highlighting the benefits of combined temporal-spatial distortions. The work offers a scalable, signal-clarifying path for high-quality, temporally coherent video generation with practical implications for large-scale video synthesis and alignment tasks.

Abstract

Direct Preference Optimization (DPO), which aligns models with human preferences through win/lose data pairs, has achieved remarkable success in language and image generation. However, applying DPO to video diffusion models faces critical challenges: (1) Data inefficiency. Generating thousands of videos per DPO iteration incurs prohibitive costs; (2) Evaluation uncertainty. Human annotations suffer from subjective bias, and automated discriminators fail to detect subtle temporal artifacts like flickering or motion incoherence. To address these, we propose a discriminator-free video DPO framework that: (1) Uses original real videos as win cases and their edited versions (e.g., reversed, shuffled, or noise-corrupted clips) as lose cases; (2) Trains video diffusion models to distinguish and avoid artifacts introduced by editing. This approach eliminates the need for costly synthetic video comparisons, provides unambiguous quality signals, and enables unlimited training data expansion through simple editing operations. We theoretically prove the framework's effectiveness even when real videos and model-generated videos follow different distributions. Experiments on CogVideoX demonstrate the efficiency of the proposed method.

Paper Structure

This paper contains 29 sections, 8 theorems, 32 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Theorem 4.2

If the following conditions hold: Then, we can show that

Figures (4)

  • Figure 1: Comparison between DPO and our proposed framework. Traditional DPO relies on computationally expensive generated video pairs, which suffer from ambiguous quality margins and scalability issues. Our method replaces generated pairs with real&edited video pairs, where edited videos serve as lose cases, and original real videos act as win cases. This approach eliminates generative overhead, provides explicit preference signals, and enables infinite scalability.
  • Figure 2: Qualitative comparison with state-of-the-art models. Compared to OpenSora opensora, OpenSoraPlan opensoraplan and CogVideoX yang2024cogvideox. The OpenSora cases in the figure exhibit certain visual distortion, while OpenSora-Plan and CogVideo cases tend to remain static. In comparison, our method demonstrates good performance in both image quality and dynamic motion quality.
  • Figure 3: Comparison with SFT methods. For the original model, the seat in the left case shows noticeable distortion, while the right case exhibits some blurring. The SFT results alleviate image quality issues but display limited motion range. In contrast, our method maintains high image and motion quality while preserving a reasonable motion amplitude.
  • Figure 4: Comparison with different edit methods. Original outputs exhibit foot distortion and motion discontinuity. Spatial Distortion improves clarity but introduces leg anomalies (frames 2-3), while Temporal Distortion enhances motion smoothness at the cost of blurring. Hybrid implementation resolves these trade-offs, achieving optimal visual-motion quality.

Theorems & Definitions (15)

  • Definition 4.1: State-action function, value function, and advantage function
  • Theorem 4.2: Optimal policy guarantees, informal version of Theorem \ref{['thm:optimal_policy_guarantees']}
  • Definition 4.3: Bradley-Terry model, BradleyTerry1952
  • Theorem 4.4: Equivalence with Bradley-Terry model, Theorem \ref{['thm:equivalence_to_bt_model']}
  • Definition 4.5: Video-frame-level direct preference optimization problem
  • Theorem 4.6: Optimal policy for video-DPO problem, informal version of Theorem \ref{['thm:optimal_policy_for_video_dpo']}
  • Theorem 4.7: Offset partition function $Z(s_t, \beta)$, informal version of Theorem \ref{['thm:offsetting_partition']}
  • Theorem A.1: Optimal policy guarantees, formal version of Theorem \ref{['thm:optimal_policy_guarantees:informal']}
  • proof
  • Theorem A.2: Equivalence with Bradley-Terry model, formal version of Theorem \ref{['thm:equivalence_to_bt_model:informal']}
  • ...and 5 more