SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model
Zhengang Li, Yan Kang, Yuchen Liu, Difan Liu, Tobias Hinz, Feng Liu, Yanzhi Wang
TL;DR
SNED tackles the heavy computational burden of designing video diffusion models by introducing a superposition NAS framework that jointly optimizes architecture under multiple cost and resolution constraints. It combines a one-shot, weight-sharing supernet with dynamic cost training and a novel super-position training strategy to yield subnets across resolutions from $64\\times64$ to $256\\times256$ and parameter counts spanning hundreds of millions to around $1.6\\mathrm{B}$. Experiments on both pixel-space and latent-space diffusion models demonstrate competitive quality (FVD and KVD) with meaningful latency reductions, validating SNED's practicality for scalable video synthesis. This work advances NAS for diffusion-based video generation, enabling flexible deployment across devices with differing compute budgets and performance needs.
Abstract
While AI-generated content has garnered significant attention, achieving photo-realistic video synthesis remains a formidable challenge. Despite the promising advances in diffusion models for video generation quality, the complex model architecture and substantial computational demands for both training and inference create a significant gap between these models and real-world applications. This paper presents SNED, a superposition network architecture search method for efficient video diffusion model. Our method employs a supernet training paradigm that targets various model cost and resolution options using a weight-sharing method. Moreover, we propose the supernet training sampling warm-up for fast training optimization. To showcase the flexibility of our method, we conduct experiments involving both pixel-space and latent-space video diffusion models. The results demonstrate that our framework consistently produces comparable results across different model options with high efficiency. According to the experiment for the pixel-space video diffusion model, we can achieve consistent video generation results simultaneously across 64 x 64 to 256 x 256 resolutions with a large range of model sizes from 640M to 1.6B number of parameters for pixel-space video diffusion models.
