Table of Contents
Fetching ...

MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training

Haotian Xue, Qi Chen, Zhonghao Wang, Xun Huang, Eli Shechtman, Jinrong Xie, Yongxin Chen

TL;DR

MoGAN addresses the limitation that frame-wise diffusion losses fail to enforce temporal motion quality in video diffusion. It introduces a motion-centric post-training framework that trains a DiT-based optical-flow discriminator on flow sequences, combined with a distribution-matching regularizer, all atop a 3-step distilled video diffusion model to preserve efficiency. The approach yields significant improvements in motion coherence, dynamics, and temporal realism on VBench and VideoJAM-Bench, with human studies favoring MoGAN for motion quality, while maintaining appearance fidelity and inference speed. This work demonstrates that adversarial learning in optical-flow space provides a scalable and effective signal for enhancing temporal dynamics without requiring external rewards or changes to the generation pipeline.

Abstract

Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.

MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training

TL;DR

MoGAN addresses the limitation that frame-wise diffusion losses fail to enforce temporal motion quality in video diffusion. It introduces a motion-centric post-training framework that trains a DiT-based optical-flow discriminator on flow sequences, combined with a distribution-matching regularizer, all atop a 3-step distilled video diffusion model to preserve efficiency. The approach yields significant improvements in motion coherence, dynamics, and temporal realism on VBench and VideoJAM-Bench, with human studies favoring MoGAN for motion quality, while maintaining appearance fidelity and inference speed. This work demonstrates that adversarial learning in optical-flow space provides a scalable and effective signal for enhancing temporal dynamics without requiring external rewards or changes to the generation pipeline.

Abstract

Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.

Paper Structure

This paper contains 33 sections, 6 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Lower diffusion loss does not imply better motion. Generated by the same model with different random seeds, the top block achieves a lower diffusion training loss ($\approx 0.36$) but the predicted $\hat{\mathbf{x}}_{0}$ exhibits ghosting, jitter, and incoherent optical flow in the highlighted regions. In contrast, the bottom block has a slightly higher loss ($\approx 0.39$) yet produces smoother, more coherent motion with consistent flow fields. This discrepancy shows that pixelwise diffusion objectives (MSE) systematically under-penalize temporal artifacts and do not adequately optimize motion quality.
  • Figure 2: Pipeline of the Proposed Few-Step Motion-GAN Post-Training. Training loop iteratively optimizes four losses: two distribution-matching losses that regularize the student to remain close to the teacher distribution, and two MoGAN losses that directly improve motion quality. (Left panel): given $t_i\in\{t_1,t_2,t_3\}$ and a condition $c_j$ from the prompt list, the few-step generator $\mathbf{G}_{\theta}$ produces an $x_0$ prediction. The teacher head $\mathbf{v}_{\text{real}}$ is frozen, while the student head $\mathbf{v}_{\text{fake}}$ learns to reflects the distribution modeled by $\mathbf{G}_{\theta}$. The optical-flow centric discriminator $\mathbf{D}_{\varphi}$ operates on dense optical-flows. (Right panel): the DiT based optical flow discriminator, refer to Section \ref{['subsecion:design_of_motion_disc']} for more details.
  • Figure 3: Qualitative comparison across models: For both prompts, we show video clips alongside the optical-flow visualization for three models: Wan2.1 (50-step), DMD-only (3-step), and our Motion-GAN post-trained model, all sampled with the same seed. Motion artifacts that are sometimes subtle in pixel space become clearly visible in the optical-flow maps and are highlighted with red boxes. More visualizations are in the Appendix.
  • Figure 4: Our Model Improves Smoothness Without Sacrificing Dynamics: Motion-GAN post-training generates more realistic motion by balancing dynamics and smoothness. In both examples, the DMD-distilled model tends to produce overly static videos, while our method generates smoother and more naturally dynamic motion.
  • Figure 5: Results of Human Survey. Side-by-side human preference study comparing our 3-step Motion-GAN with DMD (3-step) and Wan2.1 (50-step).
  • ...and 2 more figures