USV: Unified Sparsification for Accelerating Video Diffusion Models
Xinjian Wu, Hongmei Wang, Yuan Zhou, Qinglin Lu
TL;DR
<3-5 sentence high-level summary> USV tackles the scalability bottlenecks of video diffusion models by jointly sparsifying three orthogonal dimensions: attention, token counts, and denoising steps, under a learned, entropy-aware policy. Built on FastVideo with Video Sparse Attention and sparse distillation, it adds a token-merging mechanism and a dynamic scheduler that allocates sparsity across layers and timesteps. Empirical results show dramatic speedups—up to 83.3x in denoising and 22.7x end-to-end—while preserving or even improving perceptual and semantic fidelity. This unified co-design demonstrates a practical path toward efficient, scalable, high-quality video generation.
Abstract
The scalability of high-fidelity video diffusion models (VDMs) is constrained by two key sources of redundancy: the quadratic complexity of global spatio-temporal attention and the computational overhead of long iterative denoising trajectories. Existing accelerators -- such as sparse attention and step-distilled samplers -- typically target a single dimension in isolation and quickly encounter diminishing returns, as the remaining bottlenecks become dominant. In this work, we introduce USV (Unified Sparsification for Video diffusion models), an end-to-end trainable framework that overcomes this limitation by jointly orchestrating sparsification across both the model's internal computation and its sampling process. USV learns a dynamic, data- and timestep-dependent sparsification policy that prunes redundant attention connections, adaptively merges semantically similar tokens, and reduces denoising steps, treating them not as independent tricks but as coordinated actions within a single optimization objective. This multi-dimensional co-design enables strong mutual reinforcement among previously disjoint acceleration strategies. Extensive experiments on large-scale video generation benchmarks demonstrate that USV achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity. Our results highlight unified, dynamic sparsification as a practical path toward efficient, high-quality video generation.
