Table of Contents
Fetching ...

SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model

Zhengang Li, Yan Kang, Yuchen Liu, Difan Liu, Tobias Hinz, Feng Liu, Yanzhi Wang

TL;DR

SNED tackles the heavy computational burden of designing video diffusion models by introducing a superposition NAS framework that jointly optimizes architecture under multiple cost and resolution constraints. It combines a one-shot, weight-sharing supernet with dynamic cost training and a novel super-position training strategy to yield subnets across resolutions from $64\\times64$ to $256\\times256$ and parameter counts spanning hundreds of millions to around $1.6\\mathrm{B}$. Experiments on both pixel-space and latent-space diffusion models demonstrate competitive quality (FVD and KVD) with meaningful latency reductions, validating SNED's practicality for scalable video synthesis. This work advances NAS for diffusion-based video generation, enabling flexible deployment across devices with differing compute budgets and performance needs.

Abstract

While AI-generated content has garnered significant attention, achieving photo-realistic video synthesis remains a formidable challenge. Despite the promising advances in diffusion models for video generation quality, the complex model architecture and substantial computational demands for both training and inference create a significant gap between these models and real-world applications. This paper presents SNED, a superposition network architecture search method for efficient video diffusion model. Our method employs a supernet training paradigm that targets various model cost and resolution options using a weight-sharing method. Moreover, we propose the supernet training sampling warm-up for fast training optimization. To showcase the flexibility of our method, we conduct experiments involving both pixel-space and latent-space video diffusion models. The results demonstrate that our framework consistently produces comparable results across different model options with high efficiency. According to the experiment for the pixel-space video diffusion model, we can achieve consistent video generation results simultaneously across 64 x 64 to 256 x 256 resolutions with a large range of model sizes from 640M to 1.6B number of parameters for pixel-space video diffusion models.

SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model

TL;DR

SNED tackles the heavy computational burden of designing video diffusion models by introducing a superposition NAS framework that jointly optimizes architecture under multiple cost and resolution constraints. It combines a one-shot, weight-sharing supernet with dynamic cost training and a novel super-position training strategy to yield subnets across resolutions from to and parameter counts spanning hundreds of millions to around . Experiments on both pixel-space and latent-space diffusion models demonstrate competitive quality (FVD and KVD) with meaningful latency reductions, validating SNED's practicality for scalable video synthesis. This work advances NAS for diffusion-based video generation, enabling flexible deployment across devices with differing compute budgets and performance needs.

Abstract

While AI-generated content has garnered significant attention, achieving photo-realistic video synthesis remains a formidable challenge. Despite the promising advances in diffusion models for video generation quality, the complex model architecture and substantial computational demands for both training and inference create a significant gap between these models and real-world applications. This paper presents SNED, a superposition network architecture search method for efficient video diffusion model. Our method employs a supernet training paradigm that targets various model cost and resolution options using a weight-sharing method. Moreover, we propose the supernet training sampling warm-up for fast training optimization. To showcase the flexibility of our method, we conduct experiments involving both pixel-space and latent-space video diffusion models. The results demonstrate that our framework consistently produces comparable results across different model options with high efficiency. According to the experiment for the pixel-space video diffusion model, we can achieve consistent video generation results simultaneously across 64 x 64 to 256 x 256 resolutions with a large range of model sizes from 640M to 1.6B number of parameters for pixel-space video diffusion models.
Paper Structure (22 sections, 5 figures, 2 tables, 1 algorithm)

This paper contains 22 sections, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of SNED framework. (a) We train a supernet with network dynamic cost sampling and multiple input resolution options. In each iteration, a subnet of the supernet is sampled for the training, and other parts (grey) is frozen. (b) After the training, we obtain subnets with different model costs for each resolution option.
  • Figure 2: Dynamic cost scheme for SNED framework.
  • Figure 3: Results of pixel-space video diffusion model for different resolution options.
  • Figure 4: Result of different pixel-space base model subnets with different model sizes. The values of percentage indicate the relative model size compared with the supernet. We show the results of each subnet with two different noise seeds.
  • Figure 5: Comparison with LVDM under the resolution of 256$\times$256 on Sky Time-lapse dataset. We present the first frame of each video. Three subnets with different numbers of parameters are included in the comparison.