Table of Contents
Fetching ...

USV: Unified Sparsification for Accelerating Video Diffusion Models

Xinjian Wu, Hongmei Wang, Yuan Zhou, Qinglin Lu

TL;DR

<3-5 sentence high-level summary> USV tackles the scalability bottlenecks of video diffusion models by jointly sparsifying three orthogonal dimensions: attention, token counts, and denoising steps, under a learned, entropy-aware policy. Built on FastVideo with Video Sparse Attention and sparse distillation, it adds a token-merging mechanism and a dynamic scheduler that allocates sparsity across layers and timesteps. Empirical results show dramatic speedups—up to 83.3x in denoising and 22.7x end-to-end—while preserving or even improving perceptual and semantic fidelity. This unified co-design demonstrates a practical path toward efficient, scalable, high-quality video generation.

Abstract

The scalability of high-fidelity video diffusion models (VDMs) is constrained by two key sources of redundancy: the quadratic complexity of global spatio-temporal attention and the computational overhead of long iterative denoising trajectories. Existing accelerators -- such as sparse attention and step-distilled samplers -- typically target a single dimension in isolation and quickly encounter diminishing returns, as the remaining bottlenecks become dominant. In this work, we introduce USV (Unified Sparsification for Video diffusion models), an end-to-end trainable framework that overcomes this limitation by jointly orchestrating sparsification across both the model's internal computation and its sampling process. USV learns a dynamic, data- and timestep-dependent sparsification policy that prunes redundant attention connections, adaptively merges semantically similar tokens, and reduces denoising steps, treating them not as independent tricks but as coordinated actions within a single optimization objective. This multi-dimensional co-design enables strong mutual reinforcement among previously disjoint acceleration strategies. Extensive experiments on large-scale video generation benchmarks demonstrate that USV achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity. Our results highlight unified, dynamic sparsification as a practical path toward efficient, high-quality video generation.

USV: Unified Sparsification for Accelerating Video Diffusion Models

TL;DR

<3-5 sentence high-level summary> USV tackles the scalability bottlenecks of video diffusion models by jointly sparsifying three orthogonal dimensions: attention, token counts, and denoising steps, under a learned, entropy-aware policy. Built on FastVideo with Video Sparse Attention and sparse distillation, it adds a token-merging mechanism and a dynamic scheduler that allocates sparsity across layers and timesteps. Empirical results show dramatic speedups—up to 83.3x in denoising and 22.7x end-to-end—while preserving or even improving perceptual and semantic fidelity. This unified co-design demonstrates a practical path toward efficient, scalable, high-quality video generation.

Abstract

The scalability of high-fidelity video diffusion models (VDMs) is constrained by two key sources of redundancy: the quadratic complexity of global spatio-temporal attention and the computational overhead of long iterative denoising trajectories. Existing accelerators -- such as sparse attention and step-distilled samplers -- typically target a single dimension in isolation and quickly encounter diminishing returns, as the remaining bottlenecks become dominant. In this work, we introduce USV (Unified Sparsification for Video diffusion models), an end-to-end trainable framework that overcomes this limitation by jointly orchestrating sparsification across both the model's internal computation and its sampling process. USV learns a dynamic, data- and timestep-dependent sparsification policy that prunes redundant attention connections, adaptively merges semantically similar tokens, and reduces denoising steps, treating them not as independent tricks but as coordinated actions within a single optimization objective. This multi-dimensional co-design enables strong mutual reinforcement among previously disjoint acceleration strategies. Extensive experiments on large-scale video generation benchmarks demonstrate that USV achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity. Our results highlight unified, dynamic sparsification as a practical path toward efficient, high-quality video generation.

Paper Structure

This paper contains 35 sections, 19 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: (a) Speedup of USV over Wan2.1-1.3B at 480p. Log-scale ratios for end-to-end (E2E) generation and DiT denoising time, normalized to Wan as $1\times$. USV achieves over $20\times$ end-to-end and $80\times$ denoising speedup. (b) VBench comparison showing that USV maintains or slightly improves total, quality, and semantic scores compared to the original model.
  • Figure 2: Overview of USV. Left: unified sparse distillation. A sparse-distilled generator is trained to match a full-attention teacher via a distribution-matching gradient from a frozen real score network, while a fake score network is fine-tuned with a diffusion loss. Right: per-block dynamic sparsification. Queries, keys, and values are partitioned into local 3D cubes and passed through a token merging module to aggregate redundant tokens. An entropy-aware dynamic sparsity scheduler then allocates sparsity between local cube attention and long-range sparse attention (via top-$k$ cube selection). The outputs are unmerged, broadcast back to the full token grid, and fused to produce the final block output.
  • Figure 3: Layer-wise attention and entropy of the 3-step sparse-distilled student. Top: head-averaged self-attention maps of layers 0, 16, and 29 at a fixed diffusion step, showing increasingly concentrated near-diagonal patterns in deeper layers. Bottom: layer-wise attention entropy over $20$ validation videos; solid curves denote the mean for each step and shaded bands the min--max range, highlighting consistently low-entropy, highly redundant layers.
  • Figure 4: Qualitative comparison among Wan, FastWan, and our USV. Each row corresponds to a method and each column shows consecutive frames of the same prompt (480p, 81 frames; examples from two scenes). The dense Wan baseline produces stable but relatively softer textures. FastWan reduces runtime but exhibits mild temporal flicker and occasional blur due to static sparsity. In contrast, USV maintains sharper details and smoother motion across time by dynamically allocating computation along layers and timesteps. The visual trends align with the quantitative results in Table \ref{['tab:comparison']}, showing that unified sparsification substantially accelerates video diffusion without compromising perceptual fidelity.
  • Figure 5: Effect of dynamic sparsification policy. We compare our learned dynamic sparsification schedule (top) with a reversed static variant (bottom), under the same sparsity and compute budget. The reversed policy, which allocates high sparsity to later timesteps, collapses into severe temporal flickering and texture corruption, while our dynamic policy maintains stable structure and fine details across frames. This highlights that adaptively allocating sparsity over timesteps is crucial for robust video diffusion.
  • ...and 1 more figures