MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training
Haotian Xue, Qi Chen, Zhonghao Wang, Xun Huang, Eli Shechtman, Jinrong Xie, Yongxin Chen
TL;DR
MoGAN addresses the limitation that frame-wise diffusion losses fail to enforce temporal motion quality in video diffusion. It introduces a motion-centric post-training framework that trains a DiT-based optical-flow discriminator on flow sequences, combined with a distribution-matching regularizer, all atop a 3-step distilled video diffusion model to preserve efficiency. The approach yields significant improvements in motion coherence, dynamics, and temporal realism on VBench and VideoJAM-Bench, with human studies favoring MoGAN for motion quality, while maintaining appearance fidelity and inference speed. This work demonstrates that adversarial learning in optical-flow space provides a scalable and effective signal for enhancing temporal dynamics without requiring external rewards or changes to the generation pipeline.
Abstract
Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.
