VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

Juncan Deng; Shuaiting Li; Zeyu Wang; Hong Gu; Kedong Xu; Kejie Huang

VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

Juncan Deng, Shuaiting Li, Zeyu Wang, Hong Gu, Kedong Xu, Kejie Huang

TL;DR

This work tackles the deployment bottleneck of Diffusion Transformer Models (DiTs) on edge devices by introducing VQ4DiT, a post-training vector quantization framework that jointly optimizes a codebook $C$ and weight assignments $A$. The method forms candidate assignment sets $A_c$ for each sub-vector and reconstructs weights via a weighted average, then uses a zero-data and block-wise calibration objective $\mathcal{L} = \lambda_d \mathcal{L}_d + \lambda_r \mathcal{L}_r$ to update $C$ and assignment ratios, selecting optimal assignments when $\mathcal{L}_r$ is small. Experiments on the DiT XL/2 model show that 2-bit quantization achieved by VQ4DiT maintains near-FP quality across FID, sFID, IS, and Precision, while outperforming strong baselines that fail at such low bit-widths; a CUDA kernel further reduces inference time by about one-third. These results enable practical, high-resolution diffusion-based image generation on devices with limited memory and compute.

Abstract

The Diffusion Transformers Models (DiTs) have transitioned the network architecture from traditional UNets to transformers, demonstrating exceptional capabilities in image generation. Although DiTs have been widely applied to high-definition video generation tasks, their large parameter size hinders inference on edge devices. Vector quantization (VQ) can decompose model weight into a codebook and assignments, allowing extreme weight quantization and significantly reducing memory usage. In this paper, we propose VQ4DiT, a fast post-training vector quantization method for DiTs. We found that traditional VQ methods calibrate only the codebook without calibrating the assignments. This leads to weight sub-vectors being incorrectly assigned to the same assignment, providing inconsistent gradients to the codebook and resulting in a suboptimal result. To address this challenge, VQ4DiT calculates the candidate assignment set for each weight sub-vector based on Euclidean distance and reconstructs the sub-vector based on the weighted average. Then, using the zero-data and block-wise calibration method, the optimal assignment from the set is efficiently selected while calibrating the codebook. VQ4DiT quantizes a DiT XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending on the different quantization settings. Experiments show that VQ4DiT establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality.

VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

TL;DR

and weight assignments

. The method forms candidate assignment sets

for each sub-vector and reconstructs weights via a weighted average, then uses a zero-data and block-wise calibration objective

to update

and assignment ratios, selecting optimal assignments when

is small. Experiments on the DiT XL/2 model show that 2-bit quantization achieved by VQ4DiT maintains near-FP quality across FID, sFID, IS, and Precision, while outperforming strong baselines that fail at such low bit-widths; a CUDA kernel further reduces inference time by about one-third. These results enable practical, high-resolution diffusion-based image generation on devices with limited memory and compute.

Abstract

Paper Structure (18 sections, 11 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 11 equations, 6 figures, 6 tables, 1 algorithm.

Introduction
Backgrounds and Related Works
Diffusion Transformer Models
Model Quantization
Challenges of Vector Quantization for DiTs
Trade-off of codebook size
Setups of codebooks and assignments
VQ4DiT
Initialization of Codebooks and Candidate Assignment Sets
Zero-data and block-wise Calibration
EXPERIMENTS
Experimental Settings
Main Results
Ablation Study
Conclusion
...and 3 more sections

Figures (6)

Figure 1: The pipeline of VQ4DiT. (A) DiT blocks. (B) DiT blocks are quantized by vector quantization (VQ). (C) Candidate assignments and codebooks are calibrated by zero-data and block-wise calibration to ultimately obtain the optimal assignments with the highest ratios.
Figure 2: Images generated by VQ4DiT and three strong baselines: RepQ-ViT li2023repq, Q-DiT chen2024q, and GPTQ frantar2022gptq, with 3-bit and 2-bit quantization on ImageNet 256$\times$256. Our VQ4DiT model is capable of generating high-quality images even at extremely low bit-width.
Figure 3: Cosine similarity of gradients of sub-vectors with the same assignment under the two scenarios of whether the assignments are calibrated.
Figure 4: The proportion of position of optimal assignments in the candidate assignment sets with different lengths $n$. (A) $n=2$. (B) $n=3$. (C) $n=4$.
Figure 5: Images generated by VQ4DiT with 3-bit and 2-bit quantization on ImageNet 256$\times$256.
...and 1 more figures

VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

TL;DR

Abstract

VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (6)