Table of Contents
Fetching ...

Accelerating Video Diffusion Models via Distribution Matching

Yuanzhi Zhu, Hanshu Yan, Huan Yang, Kai Zhang, Junnan Li

TL;DR

This work tackles the computational bottleneck of diffusion-based video generation by introducing AVDM2, a distillation framework that combines adversarial distribution matching (ADM) with 2D score distribution matching (SDM) to train a few-step video generator from a pre-trained teacher. By freezing a teacher encoder and employing a diffusion-augmented discriminator along with frame-level SDM guidance, the method distills the knowledge into a four-step generator that delivers superior frame quality and temporal coherence. The approach demonstrates strong quantitative gains (e.g., FVD and CLIPScore) over baselines and supports flexible style transfer by leveraging 2D diffusion models, with AnimateDiff serving as the teacher. The work highlights distribution matching as a powerful tool for efficient video diffusion, albeit with limitations around one-step distillation, training overhead, and output diversity that warrant future exploration.

Abstract

Generative models, particularly diffusion models, have made significant success in data synthesis across various modalities, including images, videos, and 3D assets. However, current diffusion models are computationally intensive, often requiring numerous sampling steps that limit their practical application, especially in video generation. This work introduces a novel framework for diffusion distillation and distribution matching that dramatically reduces the number of inference steps while maintaining-and potentially improving-generation quality. Our approach focuses on distilling pre-trained diffusion models into a more efficient few-step generator, specifically targeting video generation. By leveraging a combination of video GAN loss and a novel 2D score distribution matching loss, we demonstrate the potential to generate high-quality video frames with substantially fewer sampling steps. To be specific, the proposed method incorporates a denoising GAN discriminator to distil from the real data and a pre-trained image diffusion model to enhance the frame quality and the prompt-following capabilities. Experimental results using AnimateDiff as the teacher model showcase the method's effectiveness, achieving superior performance in just four sampling steps compared to existing techniques.

Accelerating Video Diffusion Models via Distribution Matching

TL;DR

This work tackles the computational bottleneck of diffusion-based video generation by introducing AVDM2, a distillation framework that combines adversarial distribution matching (ADM) with 2D score distribution matching (SDM) to train a few-step video generator from a pre-trained teacher. By freezing a teacher encoder and employing a diffusion-augmented discriminator along with frame-level SDM guidance, the method distills the knowledge into a four-step generator that delivers superior frame quality and temporal coherence. The approach demonstrates strong quantitative gains (e.g., FVD and CLIPScore) over baselines and supports flexible style transfer by leveraging 2D diffusion models, with AnimateDiff serving as the teacher. The work highlights distribution matching as a powerful tool for efficient video diffusion, albeit with limitations around one-step distillation, training overhead, and output diversity that warrant future exploration.

Abstract

Generative models, particularly diffusion models, have made significant success in data synthesis across various modalities, including images, videos, and 3D assets. However, current diffusion models are computationally intensive, often requiring numerous sampling steps that limit their practical application, especially in video generation. This work introduces a novel framework for diffusion distillation and distribution matching that dramatically reduces the number of inference steps while maintaining-and potentially improving-generation quality. Our approach focuses on distilling pre-trained diffusion models into a more efficient few-step generator, specifically targeting video generation. By leveraging a combination of video GAN loss and a novel 2D score distribution matching loss, we demonstrate the potential to generate high-quality video frames with substantially fewer sampling steps. To be specific, the proposed method incorporates a denoising GAN discriminator to distil from the real data and a pre-trained image diffusion model to enhance the frame quality and the prompt-following capabilities. Experimental results using AnimateDiff as the teacher model showcase the method's effectiveness, achieving superior performance in just four sampling steps compared to existing techniques.

Paper Structure

This paper contains 13 sections, 9 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of proposed distribution matching loss: The generator produces a video from random noise input. This generated video undergoes forward diffusion to create a noisy video, which is then input to the discriminator for GAN loss computation. Simultaneously, $K$ random frames from the same video are diffused with noise and fed into the 2D teacher and fake model to construct the SDM loss. The discriminator is trained to classify ground truth (GT) videos and generated videos, while the 2D fake model is trained with diffusion loss to learn the generated data's diffusion distribution or score. VAE encoder and decoder are omitted in this figure for simplicity.
  • Figure 2: Comparison between our method and teacher AnimateDiff model with different sampling steps. We display the 1st, 8th and last frame.
  • Figure 3: Qualitative comparison on base model AnimateDiff. From top to bottom the text prompts are: 1) a dog with big expressive eyes running in a city park; 2) a majestic horse with a long flowing tail running at a tranquil beach; 3) a red car, moving on the road, mountain, green grass and trees; 4) Origami dancers in white paper, 3D render, ultra-detailed, on white background, studio shot, dancing modern dance.
  • Figure 4: Comparison between our method and diffusion GAN alone training on 4 step generation.
  • Figure 5: Visual results of our method with different 2D SDM models.