Taming Diffusion Transformer for Efficient Mobile Video Generation in Seconds
Yushu Wu, Yanyu Li, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ke Ma, Arpit Sahni, Ju Hu, Aliaksandr Siarohin, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov
TL;DR
The paper tackles the prohibitive computational cost of diffusion-Transformer video generation for mobile devices by integrating a high-compression video VAE, knowledge-distillation-guided tri-level pruning, adversarial step distillation, and a tiled GEMM strategy for FFN inference. This combination yields a mobile DiT capable of four-step sampling at around 15 FPS on an iPhone 16 Pro Max, while preserving strong visual fidelity and temporal coherence. Key contributions include a systematic study of latent compression trade-offs, a KD-guided pruning framework that preserves performance under memory budgets, a discriminator-based advective distillation enabling few-step generation, and hardware-aware FFN optimization. The work demonstrates practical, on-device diffusion-based video synthesis on consumer hardware, with implications for real-time, private video generation and edge-AI deployments, while acknowledging limitations related to data, reproducibility, and potential misuse.
Abstract
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and practical on-device generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable practical deployment on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platforms while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve approximately 15 frames per second (FPS) generation speed on an iPhone 16 Pro Max, demonstrating the feasibility of efficient, high-quality video generation on mobile devices.
