Accelerating Video Diffusion Models via Distribution Matching
Yuanzhi Zhu, Hanshu Yan, Huan Yang, Kai Zhang, Junnan Li
TL;DR
This work tackles the computational bottleneck of diffusion-based video generation by introducing AVDM2, a distillation framework that combines adversarial distribution matching (ADM) with 2D score distribution matching (SDM) to train a few-step video generator from a pre-trained teacher. By freezing a teacher encoder and employing a diffusion-augmented discriminator along with frame-level SDM guidance, the method distills the knowledge into a four-step generator that delivers superior frame quality and temporal coherence. The approach demonstrates strong quantitative gains (e.g., FVD and CLIPScore) over baselines and supports flexible style transfer by leveraging 2D diffusion models, with AnimateDiff serving as the teacher. The work highlights distribution matching as a powerful tool for efficient video diffusion, albeit with limitations around one-step distillation, training overhead, and output diversity that warrant future exploration.
Abstract
Generative models, particularly diffusion models, have made significant success in data synthesis across various modalities, including images, videos, and 3D assets. However, current diffusion models are computationally intensive, often requiring numerous sampling steps that limit their practical application, especially in video generation. This work introduces a novel framework for diffusion distillation and distribution matching that dramatically reduces the number of inference steps while maintaining-and potentially improving-generation quality. Our approach focuses on distilling pre-trained diffusion models into a more efficient few-step generator, specifically targeting video generation. By leveraging a combination of video GAN loss and a novel 2D score distribution matching loss, we demonstrate the potential to generate high-quality video frames with substantially fewer sampling steps. To be specific, the proposed method incorporates a denoising GAN discriminator to distil from the real data and a pre-trained image diffusion model to enhance the frame quality and the prompt-following capabilities. Experimental results using AnimateDiff as the teacher model showcase the method's effectiveness, achieving superior performance in just four sampling steps compared to existing techniques.
