Table of Contents
Fetching ...

S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, Yongjun Xu

TL;DR

Video diffusion transformers achieve high-quality synthesis but pose deployment challenges due to massive token sequences and parameter counts. The paper introduces S^2Q-VDiT, a post-training quantization framework that combines Hessian-aware Salient Data Selection with Attention-guided Sparse Token Distillation to reduce memory and compute while preserving quality, achieving near-lossless 4-bit weights and 6-bit activations and a 3.9× compression with 1.3× acceleration on large V-DMs. Key ideas include selecting calibration data via a joint diffusion-quantization salience metric and reweighting quantization losses by token importance derived from sparse attention maps. Across 2B–13B scale models, S^2Q-VDiT consistently surpasses existing PTQ baselines, enabling practical deployment of state-of-the-art video diffusion models with minimal performance degradation.

Abstract

Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose S$^2$Q-VDiT, a post-training quantization framework for V-DMs that leverages Salient data and Sparse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model's output. Under W4A6 quantization, S$^2$Q-VDiT achieves lossless performance while delivering $3.9\times$ model compression and $1.3\times$ inference acceleration. Code will be available at https://github.com/wlfeng0509/s2q-vdit.

S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

TL;DR

Video diffusion transformers achieve high-quality synthesis but pose deployment challenges due to massive token sequences and parameter counts. The paper introduces S^2Q-VDiT, a post-training quantization framework that combines Hessian-aware Salient Data Selection with Attention-guided Sparse Token Distillation to reduce memory and compute while preserving quality, achieving near-lossless 4-bit weights and 6-bit activations and a 3.9× compression with 1.3× acceleration on large V-DMs. Key ideas include selecting calibration data via a joint diffusion-quantization salience metric and reweighting quantization losses by token importance derived from sparse attention maps. Across 2B–13B scale models, S^2Q-VDiT consistently surpasses existing PTQ baselines, enabling practical deployment of state-of-the-art video diffusion models with minimal performance degradation.

Abstract

Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose SQ-VDiT, a post-training quantization framework for V-DMs that leverages Salient data and Sparse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model's output. Under W4A6 quantization, SQ-VDiT achieves lossless performance while delivering model compression and inference acceleration. Code will be available at https://github.com/wlfeng0509/s2q-vdit.

Paper Structure

This paper contains 29 sections, 11 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: We present $\text{S}^2$Q-VDiT, a post-training quantization method for video diffusion transformers. We quantize HunyuanVideo kong2024hunyuanvideo to 4-bit weights and 6-bit activations without compromising visual quality. $\text{S}^2$Q-VDiT can further achieve $3.9\times$ model compression and $1.3\times$ inference acceleration.
  • Figure 2: Overview of $\text{S}^2$Q-VDiT. The framework includes Hessian-aware Salient Data Selection (SDS) for constructing calibration dataset and Attention-guided Sparse Token Distillation (STD) for block-wise optimization.
  • Figure 3: Visualization of different calibration data on CogVideoX-2B. We compare our proposed method with All Timesteps from One Prompt (ATOP), All Timesteps from Five Prompts (ATFP), and Random Timesteps from Five Prompts (RTFP). Our method has better generation quality.
  • Figure 4: Visualization of sparse attention patterns in CogVideoX-2B block-10. In (\ref{['fig:heatmaps']}), fewer columns have significantly higher weights. In (\ref{['fig:token_wise_attention']}), only 10% of tokens have larger attention weights.
  • Figure 5: Visual comparison on different models under W4A6 quantization setting.
  • ...and 12 more figures