Table of Contents
Fetching ...

QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu

TL;DR

This work addresses the heavy compute and memory demands of diffusion-transformer–based video generation by proposing QuantSparse, a unified framework that tightly couples model quantization with attention sparsification. It introduces two novel components: Multi-Scale Salient Attention Distillation (MSAD) to align attention under quantization through global and local supervision, and Second-Order Sparse Attention Reparameterization (SSAR) to exploit temporally stable second-order residuals and a cache-based correction via SVD. Empirical results on Wan2.1 and HunyuanVideo models show QuantSparse achieves substantial efficiency gains—up to 3.68× storage reduction and 1.74×–1.88× speedups—while maintaining or even improving PSNR, LPIPS, and perceptual metrics relative to strong baselines. The findings suggest that carefully designed synergy between quantization and sparse attention can unlock practical deployment of large-scale video diffusion models without compromising quality.

Abstract

Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose \textbf{QuantSparse}, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce \textit{Multi-Scale Salient Attention Distillation}, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop \textit{Second-Order Sparse Attention Reparameterization}, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a \textbf{3.68$\times$} reduction in storage and \textbf{1.88$\times$} acceleration in end-to-end inference. Our code will be released in https://github.com/wlfeng0509/QuantSparse.

QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

TL;DR

This work addresses the heavy compute and memory demands of diffusion-transformer–based video generation by proposing QuantSparse, a unified framework that tightly couples model quantization with attention sparsification. It introduces two novel components: Multi-Scale Salient Attention Distillation (MSAD) to align attention under quantization through global and local supervision, and Second-Order Sparse Attention Reparameterization (SSAR) to exploit temporally stable second-order residuals and a cache-based correction via SVD. Empirical results on Wan2.1 and HunyuanVideo models show QuantSparse achieves substantial efficiency gains—up to 3.68× storage reduction and 1.74×–1.88× speedups—while maintaining or even improving PSNR, LPIPS, and perceptual metrics relative to strong baselines. The findings suggest that carefully designed synergy between quantization and sparse attention can unlock practical deployment of large-scale video diffusion models without compromising quality.

Abstract

Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose \textbf{QuantSparse}, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce \textit{Multi-Scale Salient Attention Distillation}, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop \textit{Second-Order Sparse Attention Reparameterization}, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a \textbf{3.68} reduction in storage and \textbf{1.88} acceleration in end-to-end inference. Our code will be released in https://github.com/wlfeng0509/QuantSparse.

Paper Structure

This paper contains 30 sections, 4 theorems, 20 equations, 17 figures, 13 tables.

Key Result

Proposition 3.1

Quantization injects noise $\epsilon$ into the QK dot product $\mathbf{Q}\mathbf{K}^\top$, yielding a systematic bias $\delta$: The parallel error caused by quantization and sparse attention further leads to a compounded shift:

Figures (17)

  • Figure 1: QuantSparse effectively quantizes Wan2.1 14B wan2025wan and HunyuanVideo kong2024hunyuanvideo to W4A8 with 15% attention density without compromising visual quality.
  • Figure 2: Overview of proposed QuantSparse.Left: Attention distillation for robust alignment during calibration. Right: Efficient and accurate attention approximation during inference.
  • Figure 3: The motivation and effect of Multi-Scale Salient Attention Distillation. (a): Token saliency distribution of Wan2.1-1.3B wan2025wanblock19 head1. Only less than 10% tokens are salient. (b)(c): Visualization of attention difference between quantized model and FP model. (d): Memory consumption of different attention distillation.
  • Figure 4: The motivation and effect of Second-Order Sparse Attention Reparameterization. The results are from HunyuanVideo-13B kong2024hunyuanvideosingle_transformer_block.10 under W4A8. We provide more visualization and analysis in Appendix Sec. \ref{['sec:more_ssar']}.
  • Figure 5: Visual comparison on Wan2.1-14B under W4A8 quantization setting. We uniformly sample two frames for visualization. '(xx%)' denotes the attention density.
  • ...and 12 more figures

Theorems & Definitions (4)

  • Proposition 3.1
  • Proposition 3.2
  • Proposition 3.3
  • Theorem 3.4