VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers
Juncan Deng, Shuaiting Li, Zeyu Wang, Hong Gu, Kedong Xu, Kejie Huang
TL;DR
This work tackles the deployment bottleneck of Diffusion Transformer Models (DiTs) on edge devices by introducing VQ4DiT, a post-training vector quantization framework that jointly optimizes a codebook $C$ and weight assignments $A$. The method forms candidate assignment sets $A_c$ for each sub-vector and reconstructs weights via a weighted average, then uses a zero-data and block-wise calibration objective $\mathcal{L} = \lambda_d \mathcal{L}_d + \lambda_r \mathcal{L}_r$ to update $C$ and assignment ratios, selecting optimal assignments when $\mathcal{L}_r$ is small. Experiments on the DiT XL/2 model show that 2-bit quantization achieved by VQ4DiT maintains near-FP quality across FID, sFID, IS, and Precision, while outperforming strong baselines that fail at such low bit-widths; a CUDA kernel further reduces inference time by about one-third. These results enable practical, high-resolution diffusion-based image generation on devices with limited memory and compute.
Abstract
The Diffusion Transformers Models (DiTs) have transitioned the network architecture from traditional UNets to transformers, demonstrating exceptional capabilities in image generation. Although DiTs have been widely applied to high-definition video generation tasks, their large parameter size hinders inference on edge devices. Vector quantization (VQ) can decompose model weight into a codebook and assignments, allowing extreme weight quantization and significantly reducing memory usage. In this paper, we propose VQ4DiT, a fast post-training vector quantization method for DiTs. We found that traditional VQ methods calibrate only the codebook without calibrating the assignments. This leads to weight sub-vectors being incorrectly assigned to the same assignment, providing inconsistent gradients to the codebook and resulting in a suboptimal result. To address this challenge, VQ4DiT calculates the candidate assignment set for each weight sub-vector based on Euclidean distance and reconstructs the sub-vector based on the weighted average. Then, using the zero-data and block-wise calibration method, the optimal assignment from the set is efficiently selected while calibrating the codebook. VQ4DiT quantizes a DiT XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending on the different quantization settings. Experiments show that VQ4DiT establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality.
