Table of Contents
Fetching ...

AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset

Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, Yu Qiao

TL;DR

AccVideo tackles the slow inference of video diffusion models by analyzing and mitigating useless distillation data arising from dataset and Gaussian-noise mismatches. It constructs SynVid, a 110K-trajectory synthetic dataset with high-quality video and denoising paths, and trains a lighter student model via trajectory-based few-step guidance, reducing steps by about an order of magnitude. An adversarial training strategy leverages the dataset’s diffusion-timestep distributions to align the student’s outputs with the synthetic data, improving video quality without complex regularization. Empirically, AccVideo delivers up to 8.5× faster generation than the teacher while maintaining comparable quality, and achieves high-resolution outputs at 5 seconds, 720×1280, 24fps, surpassing prior accelerating methods in both speed and visual fidelity.

Abstract

Diffusion models have achieved remarkable progress in the field of video generation. However, their iterative denoising nature requires a large number of inference steps to generate a video, which is slow and computationally expensive. In this paper, we begin with a detailed analysis of the challenges present in existing diffusion distillation methods and propose a novel efficient method, namely AccVideo, to reduce the inference steps for accelerating video diffusion models with synthetic dataset. We leverage the pretrained video diffusion model to generate multiple valid denoising trajectories as our synthetic dataset, which eliminates the use of useless data points during distillation. Based on the synthetic dataset, we design a trajectory-based few-step guidance that utilizes key data points from the denoising trajectories to learn the noise-to-video mapping, enabling video generation in fewer steps. Furthermore, since the synthetic dataset captures the data distribution at each diffusion timestep, we introduce an adversarial training strategy to align the output distribution of the student model with that of our synthetic dataset, thereby enhancing the video quality. Extensive experiments demonstrate that our model achieves 8.5x improvements in generation speed compared to the teacher model while maintaining comparable performance. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution, i.e., 5-seconds, 720x1280, 24fps.

AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset

TL;DR

AccVideo tackles the slow inference of video diffusion models by analyzing and mitigating useless distillation data arising from dataset and Gaussian-noise mismatches. It constructs SynVid, a 110K-trajectory synthetic dataset with high-quality video and denoising paths, and trains a lighter student model via trajectory-based few-step guidance, reducing steps by about an order of magnitude. An adversarial training strategy leverages the dataset’s diffusion-timestep distributions to align the student’s outputs with the synthetic data, improving video quality without complex regularization. Empirically, AccVideo delivers up to 8.5× faster generation than the teacher while maintaining comparable quality, and achieves high-resolution outputs at 5 seconds, 720×1280, 24fps, surpassing prior accelerating methods in both speed and visual fidelity.

Abstract

Diffusion models have achieved remarkable progress in the field of video generation. However, their iterative denoising nature requires a large number of inference steps to generate a video, which is slow and computationally expensive. In this paper, we begin with a detailed analysis of the challenges present in existing diffusion distillation methods and propose a novel efficient method, namely AccVideo, to reduce the inference steps for accelerating video diffusion models with synthetic dataset. We leverage the pretrained video diffusion model to generate multiple valid denoising trajectories as our synthetic dataset, which eliminates the use of useless data points during distillation. Based on the synthetic dataset, we design a trajectory-based few-step guidance that utilizes key data points from the denoising trajectories to learn the noise-to-video mapping, enabling video generation in fewer steps. Furthermore, since the synthetic dataset captures the data distribution at each diffusion timestep, we introduce an adversarial training strategy to align the output distribution of the student model with that of our synthetic dataset, thereby enhancing the video quality. Extensive experiments demonstrate that our model achieves 8.5x improvements in generation speed compared to the teacher model while maintaining comparable performance. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution, i.e., 5-seconds, 720x1280, 24fps.

Paper Structure

This paper contains 17 sections, 8 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Video diffusion models can generate high-quality videos, but they require dozens of inference steps, resulting in slow generation process. For instance, HunyuanVideo kong2024hunyuanvideo takes 3234 seconds to generate a 5-seconds, 720$\times$1280, 24fps video on a single A100 GPU. In contrast, our method accelerates video diffusion models through distillation, achieving 8.5$\times$ improvements in generation speed while maintaining comparable quality.
  • Figure 2: 1D Toy Experiment. We employ Flow Matching objective lipman2022flow to train the teacher model, which learns the ODE that maps Gaussian distribution to the data distribution. The data distribution consists of two data points. a) illustrates the knowledge distillation methods, where a student model is trained to mimic the teacher model's denoising process. b) highlights the challenges posed by dataset or Gaussian noise mismatching in knowledge distillation, which can lead to unreliable guidance. c) demonstrates the distribution matching methods, which aims to align the output distribution of the student model with that of the teacher model. d) emphasizes the issue in distribution matching, which can result in inaccurate guidance. e) illustrates the frequency of useless data points in relation to $M$. f) shows the distillation results at various values of $M$.
  • Figure 3: Method Overview.(a) Our method first designs a trajectory-based few-step guidance, which utilizes the key data points from the denoising trajectory to enable the student model to mimic the denoising process of the pretrained video diffusion model with fewer steps. (b) To fully exploit the data distribution at each diffusion timestep captured by our synthetic dataset, we propose an adversarial training strategy to align the output distribution of the student model with that captured by our synthetic dataset.
  • Figure 4: The pipeline of constructing SynVid.
  • Figure 5: The illustration of features at different layers and diffusion timesteps of our feature extractor. The features within the red box are selected for discrimination.
  • ...and 7 more figures