Mobile Video Diffusion
Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir Ghodrati, Amirhossein Habibian
TL;DR
MobileVD addresses the high computational demands of video diffusion by engineering a mobile-ready spatio-temporal UNet derived from Stable Video Diffusion. The approach combines frame downsampling, temporal multiscaling, channel funnels with CSI initialization, temporal block pruning, and adversarial finetuning to a single denoising step, achieving dramatic efficiency gains with minimal quality loss. Key results show a 523× reduction in compute and latencies suitable for on-device use, generating 14-frame latents on a 14 Pro in about 1.7 s, while FVD increases modestly (149 vs 171). This work enables practical on-device video diffusion for consumer devices and sets a path toward higher-resolution mobile video generation through further compression and autoencoder efficiency.
Abstract
Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/
