Table of Contents
Fetching ...

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, Aliaksandr Siarohin

TL;DR

DenseDPO tackles motion bias in direct preference optimization for video diffusion by pairing guided, structurally similar videos and by collecting dense segment-level preferences. It combines a per-segment DPO objective with temporally aligned sampling to provide rich supervision, and demonstrates that short-segment VLM labels (e.g., GPT-o3 Segment) can approach human-label performance in DPO. Experiments show DenseDPO delivers markedly higher dynamic degree while preserving visual quality and text alignment, with about one-third the human labeling effort. The approach improves data efficiency, enables automatic labeling, and maintains compatibility with existing diffusion training paradigms, making motion-rich video generation more practical in real-world settings.

Abstract

Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

TL;DR

DenseDPO tackles motion bias in direct preference optimization for video diffusion by pairing guided, structurally similar videos and by collecting dense segment-level preferences. It combines a per-segment DPO objective with temporally aligned sampling to provide rich supervision, and demonstrates that short-segment VLM labels (e.g., GPT-o3 Segment) can approach human-label performance in DPO. Experiments show DenseDPO delivers markedly higher dynamic degree while preserving visual quality and text alignment, with about one-third the human labeling effort. The approach improves data efficiency, enables automatic labeling, and maintains compatibility with existing diffusion training paradigms, making motion-rich video generation more practical in real-world settings.

Abstract

Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.

Paper Structure

This paper contains 29 sections, 15 equations, 11 figures, 8 tables, 2 algorithms.

Figures (11)

  • Figure 1: Text-to-video results with our DenseDPO aligned model. Our method improves both visual quality and temporal consistency of the model, enabling generation of challenging motion.
  • Figure 2: Comparison between VanillaDPO (top) and DenseDPO (bottom). VanillaDPO compares two videos generated from independent random noises and only assigns a single binary preference, biasing the annotators toward slow-motion videos. In contrast, DenseDPO generates structurally similar videos from partially noised real videos, and label segment-level dense preferences.
  • Figure 3: Guided video generation with different $\eta$. Lower $\eta$ means more guidance. We sample one frame per video for visualization. $\eta=0.75$ is enough to maintain the motion trajectory and high-level semantics of the ground-truth video. For slow-motion videos (top), a high $\eta$ suffices to generate artifact-free videos, while videos with challenging motion (bottom) require more guidance.
  • Figure 4: Qualitative results. Pre-trained model generates deformed limbs. VanillaDPO fixes it but generates almost static motion. StructuralDPO retains dynamics but produces oversaturated frames. DenseDPO is the only method that generates correct limbs, large dynamics, and high quality visuals. Please check out our https://snap-research.github.io/DenseDPO/ for video results of baselines and our methods.
  • Figure 5: Human evaluation of DenseDPO vs. StructuralDPO (left) and VanillaDPO (right). TA, VQ, TC, DD stand for text alignment, visual quality, temporal consistency, and dynamic degree.
  • ...and 6 more figures