Table of Contents
Fetching ...

VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

Juncan Deng, Shuaiting Li, Zeyu Wang, Hong Gu, Kedong Xu, Kejie Huang

TL;DR

This work tackles the deployment bottleneck of Diffusion Transformer Models (DiTs) on edge devices by introducing VQ4DiT, a post-training vector quantization framework that jointly optimizes a codebook $C$ and weight assignments $A$. The method forms candidate assignment sets $A_c$ for each sub-vector and reconstructs weights via a weighted average, then uses a zero-data and block-wise calibration objective $\mathcal{L} = \lambda_d \mathcal{L}_d + \lambda_r \mathcal{L}_r$ to update $C$ and assignment ratios, selecting optimal assignments when $\mathcal{L}_r$ is small. Experiments on the DiT XL/2 model show that 2-bit quantization achieved by VQ4DiT maintains near-FP quality across FID, sFID, IS, and Precision, while outperforming strong baselines that fail at such low bit-widths; a CUDA kernel further reduces inference time by about one-third. These results enable practical, high-resolution diffusion-based image generation on devices with limited memory and compute.

Abstract

The Diffusion Transformers Models (DiTs) have transitioned the network architecture from traditional UNets to transformers, demonstrating exceptional capabilities in image generation. Although DiTs have been widely applied to high-definition video generation tasks, their large parameter size hinders inference on edge devices. Vector quantization (VQ) can decompose model weight into a codebook and assignments, allowing extreme weight quantization and significantly reducing memory usage. In this paper, we propose VQ4DiT, a fast post-training vector quantization method for DiTs. We found that traditional VQ methods calibrate only the codebook without calibrating the assignments. This leads to weight sub-vectors being incorrectly assigned to the same assignment, providing inconsistent gradients to the codebook and resulting in a suboptimal result. To address this challenge, VQ4DiT calculates the candidate assignment set for each weight sub-vector based on Euclidean distance and reconstructs the sub-vector based on the weighted average. Then, using the zero-data and block-wise calibration method, the optimal assignment from the set is efficiently selected while calibrating the codebook. VQ4DiT quantizes a DiT XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending on the different quantization settings. Experiments show that VQ4DiT establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality.

VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

TL;DR

This work tackles the deployment bottleneck of Diffusion Transformer Models (DiTs) on edge devices by introducing VQ4DiT, a post-training vector quantization framework that jointly optimizes a codebook and weight assignments . The method forms candidate assignment sets for each sub-vector and reconstructs weights via a weighted average, then uses a zero-data and block-wise calibration objective to update and assignment ratios, selecting optimal assignments when is small. Experiments on the DiT XL/2 model show that 2-bit quantization achieved by VQ4DiT maintains near-FP quality across FID, sFID, IS, and Precision, while outperforming strong baselines that fail at such low bit-widths; a CUDA kernel further reduces inference time by about one-third. These results enable practical, high-resolution diffusion-based image generation on devices with limited memory and compute.

Abstract

The Diffusion Transformers Models (DiTs) have transitioned the network architecture from traditional UNets to transformers, demonstrating exceptional capabilities in image generation. Although DiTs have been widely applied to high-definition video generation tasks, their large parameter size hinders inference on edge devices. Vector quantization (VQ) can decompose model weight into a codebook and assignments, allowing extreme weight quantization and significantly reducing memory usage. In this paper, we propose VQ4DiT, a fast post-training vector quantization method for DiTs. We found that traditional VQ methods calibrate only the codebook without calibrating the assignments. This leads to weight sub-vectors being incorrectly assigned to the same assignment, providing inconsistent gradients to the codebook and resulting in a suboptimal result. To address this challenge, VQ4DiT calculates the candidate assignment set for each weight sub-vector based on Euclidean distance and reconstructs the sub-vector based on the weighted average. Then, using the zero-data and block-wise calibration method, the optimal assignment from the set is efficiently selected while calibrating the codebook. VQ4DiT quantizes a DiT XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending on the different quantization settings. Experiments show that VQ4DiT establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality.
Paper Structure (18 sections, 11 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 11 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: The pipeline of VQ4DiT. (A) DiT blocks. (B) DiT blocks are quantized by vector quantization (VQ). (C) Candidate assignments and codebooks are calibrated by zero-data and block-wise calibration to ultimately obtain the optimal assignments with the highest ratios.
  • Figure 2: Images generated by VQ4DiT and three strong baselines: RepQ-ViT li2023repq, Q-DiT chen2024q, and GPTQ frantar2022gptq, with 3-bit and 2-bit quantization on ImageNet 256$\times$256. Our VQ4DiT model is capable of generating high-quality images even at extremely low bit-width.
  • Figure 3: Cosine similarity of gradients of sub-vectors with the same assignment under the two scenarios of whether the assignments are calibrated.
  • Figure 4: The proportion of position of optimal assignments in the candidate assignment sets with different lengths $n$. (A) $n=2$. (B) $n=3$. (C) $n=4$.
  • Figure 5: Images generated by VQ4DiT with 3-bit and 2-bit quantization on ImageNet 256$\times$256.
  • ...and 1 more figures