Table of Contents
Fetching ...

Mobile Video Diffusion

Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir Ghodrati, Amirhossein Habibian

TL;DR

MobileVD addresses the high computational demands of video diffusion by engineering a mobile-ready spatio-temporal UNet derived from Stable Video Diffusion. The approach combines frame downsampling, temporal multiscaling, channel funnels with CSI initialization, temporal block pruning, and adversarial finetuning to a single denoising step, achieving dramatic efficiency gains with minimal quality loss. Key results show a 523× reduction in compute and latencies suitable for on-device use, generating 14-frame latents on a 14 Pro in about 1.7 s, while FVD increases modestly (149 vs 171). This work enables practical on-device video diffusion for consumer devices and sets a path toward higher-resolution mobile video generation through further compression and autoencoder efficiency.

Abstract

Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/

Mobile Video Diffusion

TL;DR

MobileVD addresses the high computational demands of video diffusion by engineering a mobile-ready spatio-temporal UNet derived from Stable Video Diffusion. The approach combines frame downsampling, temporal multiscaling, channel funnels with CSI initialization, temporal block pruning, and adversarial finetuning to a single denoising step, achieving dramatic efficiency gains with minimal quality loss. Key results show a 523× reduction in compute and latencies suitable for on-device use, generating 14-frame latents on a 14 Pro in about 1.7 s, while FVD increases modestly (149 vs 171). This work enables practical on-device video diffusion for consumer devices and sets a path toward higher-resolution mobile video generation through further compression and autoencoder efficiency.

Abstract

Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/

Paper Structure

This paper contains 23 sections, 1 theorem, 20 equations, 5 figures, 9 tables.

Key Result

Lemma B.1

For all indices $i$ it holds true that $\gamma_i = 0$.

Figures (5)

  • Figure 1: Quality-efficiency trade-off. Our MobileVD accelerate SVD by $523\times$ (in FLOPs) with a slight decrease in the generation qualities (in FVD) reaching to a better quality vs. efficiency trade-off than alternatives.
  • Figure 2: Effect of optimized cross-attention for a mobile device. We show the number of cycles of the top-4 operations on mobile hardware for an input resolution of $128 \times 128$. Note that removing the no-op similarity map computation in cross-attention layers reduces cycles on softmax operations by roughly 80.0%.
  • Figure 3: Channel funnels. We show an example of channel funnels applied to a couple of layers within the model. At training time, funnels serve as adaptors reducing model width. At inference, they are merged with corresponding weight matrices without loss of quality.
  • Figure 4: Learned pruning of temporal blocks. (a) Each temporal block in the base SVD model is implemented as a residual block w.r.t. its input $x_s$. The output of temporal layers $r_t$ is summed with the input $x_s$, and after that once again averaged with $x_s$ with learnable weight $\alpha$. By reordering the terms, we derive the effective update rule $\alpha x_s +\left(1 - \alpha\right) x_t = x_s + \left(1 - \alpha\right) r_t$. (b) During training, we introduce a scalar gate $\hat{z} \in \left\lbrace 0, 1 \right\rbrace$ to the residual update rule of each block. We learn importance values $\left\{q_i\right\}_i$ of temporal blocks which are transformed to inclusion probabilities $\left\{p_i\right\}_i$ at each training step. Zero-one gate multipliers are sampled according to those probabilities. To enable end-to-end training, we use straight-through estimator trick. At inference, only $n$ blocks with highest importance values are used.
  • Figure 5: Comparison with recent models. We provide the 1st, 6th, 10th and 14th frames from the videos generated with different models. For AnimateLCM wang2024animatelcm and SF-V zhang_svf_2024 we downsampled the released high-resolution videos from zhang_svf_2024. For SVD blattmann_stable_2023 and our MobileVD model, videos were generated at their native resolution, $1024 \times 576$ and $512 \times 256$ respectively.

Theorems & Definitions (2)

  • Lemma B.1
  • proof