Table of Contents
Fetching ...

Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, Michele Magno

TL;DR

Q-VDiT addresses the quantization of video-generation diffusion transformers by introducing Token-aware Quantization Estimator (TQE) and Temporal Maintenance Distillation (TMD). TQE compensates quantization errors across token and feature dimensions, while TMD aligns inter-frame temporal relationships to preserve video coherence via KL-divergence between frame distributions. The approach yields state-of-the-art scene consistency and multi-aspect video quality under low-bit settings (e.g., $23.40$ scene consistency at W3A6) with substantial memory and runtime efficiency improvements. This work enables more efficient deployment of video diffusion transformers on edge devices without substantial degradation in perceptual video quality.

Abstract

Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token-aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$\times$. Code will be available at https://github.com/cantbebetter2/Q-VDiT.

Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

TL;DR

Q-VDiT addresses the quantization of video-generation diffusion transformers by introducing Token-aware Quantization Estimator (TQE) and Temporal Maintenance Distillation (TMD). TQE compensates quantization errors across token and feature dimensions, while TMD aligns inter-frame temporal relationships to preserve video coherence via KL-divergence between frame distributions. The approach yields state-of-the-art scene consistency and multi-aspect video quality under low-bit settings (e.g., scene consistency at W3A6) with substantial memory and runtime efficiency improvements. This work enables more efficient deployment of video diffusion transformers on edge devices without substantial degradation in perceptual video quality.

Abstract

Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token-aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9. Code will be available at https://github.com/cantbebetter2/Q-VDiT.

Paper Structure

This paper contains 25 sections, 3 theorems, 24 equations, 15 figures, 7 tables.

Key Result

Proposition 3.1

Given a $L$ layer model $f\{\mathbf{W}_{i=1}^L\}$, the quantization process for weight is equivalent to applying a perturbation $\Delta$ to the original weight: where $\mathbf{W}_i$ stands for $i$-th layer weight.

Figures (15)

  • Figure 1: Evaluation on VBench of different quantization methods under W3A6 setting.
  • Figure 2: Overview of proposed Q-VDiT. The framework includes Token-aware Quantization Estimator (TQE) for forward process and Temporal Maintenance Distillation (TMD) for optimization. The middle part denotes the quantized forward process. $\otimes$ denotes matrix multiplication, $\odot$ denotes token-wise multiplication.
  • Figure 3: An illustration of TQE in Q-VDiT
  • Figure 4: An illustration of TMD in Q-VDiT. We have enlarged the upper left and lower right corners additionally.
  • Figure 5: Visualization of different frames in a single video.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Proposition 3.1
  • Theorem 3.2
  • Lemma 1.1