Table of Contents
Fetching ...

Taming Diffusion Transformer for Efficient Mobile Video Generation in Seconds

Yushu Wu, Yanyu Li, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ke Ma, Arpit Sahni, Ju Hu, Aliaksandr Siarohin, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov

TL;DR

The paper tackles the prohibitive computational cost of diffusion-Transformer video generation for mobile devices by integrating a high-compression video VAE, knowledge-distillation-guided tri-level pruning, adversarial step distillation, and a tiled GEMM strategy for FFN inference. This combination yields a mobile DiT capable of four-step sampling at around 15 FPS on an iPhone 16 Pro Max, while preserving strong visual fidelity and temporal coherence. Key contributions include a systematic study of latent compression trade-offs, a KD-guided pruning framework that preserves performance under memory budgets, a discriminator-based advective distillation enabling few-step generation, and hardware-aware FFN optimization. The work demonstrates practical, on-device diffusion-based video synthesis on consumer hardware, with implications for real-time, private video generation and edge-AI deployments, while acknowledging limitations related to data, reproducibility, and potential misuse.

Abstract

Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and practical on-device generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable practical deployment on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platforms while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve approximately 15 frames per second (FPS) generation speed on an iPhone 16 Pro Max, demonstrating the feasibility of efficient, high-quality video generation on mobile devices.

Taming Diffusion Transformer for Efficient Mobile Video Generation in Seconds

TL;DR

The paper tackles the prohibitive computational cost of diffusion-Transformer video generation for mobile devices by integrating a high-compression video VAE, knowledge-distillation-guided tri-level pruning, adversarial step distillation, and a tiled GEMM strategy for FFN inference. This combination yields a mobile DiT capable of four-step sampling at around 15 FPS on an iPhone 16 Pro Max, while preserving strong visual fidelity and temporal coherence. Key contributions include a systematic study of latent compression trade-offs, a KD-guided pruning framework that preserves performance under memory budgets, a discriminator-based advective distillation enabling few-step generation, and hardware-aware FFN optimization. The work demonstrates practical, on-device diffusion-based video synthesis on consumer hardware, with implications for real-time, private video generation and edge-AI deployments, while acknowledging limitations related to data, reproducibility, and potential misuse.

Abstract

Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and practical on-device generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable practical deployment on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platforms while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve approximately 15 frames per second (FPS) generation speed on an iPhone 16 Pro Max, demonstrating the feasibility of efficient, high-quality video generation on mobile devices.

Paper Structure

This paper contains 35 sections, 10 equations, 9 figures, 13 tables, 1 algorithm.

Figures (9)

  • Figure 1: Videos generated by our efficient Diffusion Transformer.
  • Figure 2: Overview of proposed KD-Guided Tri-Level Pruning and new discriminator head. The tri-level pruning scheme operates across three levels of granularity, the block, attention-head, and feed-forward network dimension, ranging from coarse to fine. This design enables flexible, efficient, and stable model compression. Additionally, the proposed discriminator adopts standard DiT blocks with a MLP classifier head, improved condition alignment for adversarial training.
  • Figure 3: Sensitivity Analysis of DiT Components. The sensitivity analysis is conducted by progressively pruning DiT blocks, attention-heads and feed-forward network (FFN) dimension. For each setting, we benchmark FLOPs, memory usage, inference speed, and VBench score to assess the impact of each component on model efficiency and performance.
  • Figure 4: Video generated by our efficient diffusion transformer.
  • Figure 5: Illustration for tiled GEMM for a single token. The input $X$ and weights $W$ are both tiled into $k$ partitions along input feature.
  • ...and 4 more figures