Table of Contents
Fetching ...

PTQ4DiT: Post-training Quantization for Diffusion Transformers

Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, Yan Yan

TL;DR

This paper proposes PTQ4DiT, a specifically designed PTQ method for DiTs that successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.

Abstract

The recent introduction of Diffusion Transformers (DiTs) has demonstrated exceptional capabilities in image generation by using a different backbone architecture, departing from traditional U-Nets and embracing the scalable nature of transformers. Despite their advanced capabilities, the wide deployment of DiTs, particularly for real-time applications, is currently hampered by considerable computational demands at the inference stage. Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint by using low-bit weights and activations. However, its applicability to DiTs has not yet been explored and faces non-trivial difficulties due to the unique design of DiTs. In this paper, we propose PTQ4DiT, a specifically designed PTQ method for DiTs. We discover two primary quantization challenges inherent in DiTs, notably the presence of salient channels with extreme magnitudes and the temporal variability in distributions of salient activation over multiple timesteps. To tackle these challenges, we propose Channel-wise Salience Balancing (CSB) and Spearmen's $ρ$-guided Salience Calibration (SSC). CSB leverages the complementarity property of channel magnitudes to redistribute the extremes, alleviating quantization errors for both activations and weights. SSC extends this approach by dynamically adjusting the balanced salience to capture the temporal variations in activation. Additionally, to eliminate extra computational costs caused by PTQ4DiT during inference, we design an offline re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.

PTQ4DiT: Post-training Quantization for Diffusion Transformers

TL;DR

This paper proposes PTQ4DiT, a specifically designed PTQ method for DiTs that successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.

Abstract

The recent introduction of Diffusion Transformers (DiTs) has demonstrated exceptional capabilities in image generation by using a different backbone architecture, departing from traditional U-Nets and embracing the scalable nature of transformers. Despite their advanced capabilities, the wide deployment of DiTs, particularly for real-time applications, is currently hampered by considerable computational demands at the inference stage. Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint by using low-bit weights and activations. However, its applicability to DiTs has not yet been explored and faces non-trivial difficulties due to the unique design of DiTs. In this paper, we propose PTQ4DiT, a specifically designed PTQ method for DiTs. We discover two primary quantization challenges inherent in DiTs, notably the presence of salient channels with extreme magnitudes and the temporal variability in distributions of salient activation over multiple timesteps. To tackle these challenges, we propose Channel-wise Salience Balancing (CSB) and Spearmen's -guided Salience Calibration (SSC). CSB leverages the complementarity property of channel magnitudes to redistribute the extremes, alleviating quantization errors for both activations and weights. SSC extends this approach by dynamically adjusting the balanced salience to capture the temporal variations in activation. Additionally, to eliminate extra computational costs caused by PTQ4DiT during inference, we design an offline re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.
Paper Structure (20 sections, 21 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 21 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: (Left) Illustration of salient channels in activation and weight. Note that salient activation channels exhibit variations over different timesteps (e.g., $t = t_1, t_2, t_3.$), posing non-trivial quantization challenges. To mitigate the overall quantization difficulty, our method leverages the complementarity (activation and weight channels do not have extreme magnitude simultaneously) to redistribute channel salience between weights and activations across various timesteps. (Right) Quantization performance on W8A8 and W4A8, employing FID (lower is better) and IS (higher is better) metrics on ImageNet 256$\times$256 russakovsky2015imagenet. The circle size indicates the model size.
  • Figure 2: (Left) Overview of the Diffusion Transformer (DiT) Block peebles2023scalable. (Middle) Illustration of the linear layer in Multi-Head Self-Attention (MHSA) and Pointwise Feedforward (PF) modules, which incorporates our proposed Channel-wise Salience Balancing (CSB) and Spearman's $\rho$-guided Salience Calibration (SSC) to address quantization difficulties for both activation $\mathbf{X}$ and weight $\mathbf{W}$. Appendix \ref{['appendix: structure']} depicts detailed structures of the MHSA and PF modules with adjusted linear layers. (Right) Illustration of CSB and SSC in PTQ4DiT. CSB redistributes salient channels between weights and activations from various timesteps to reduce overall quantization errors. SSC calibrates the activation salience across multiple timesteps via selective aggregation, with more focus on timesteps where quantization errors can be significantly reduced by CSB.
  • Figure 3: Illustration of maximal absolute magnitudes of activation (left) and weight (right) channels in a DiT linear layer, alongside their corresponding quantization Error (MSE). Channels with greater maximal absolute values tend to incur larger errors, presenting a fundamental quantization difficulty.
  • Figure 4: Boxplot of maximal absolute magnitudes of activation channels in a linear layer within DiT over different timesteps, which exhibit significant temporal variations.
  • Figure 5: Random samples generated by PTQ4DiT and two strong baselines: RepQ* li2023repq and Q-Diffusion qdiffusion, with W4A8 quantization on ImageNet 512$\times$512 and 256$\times$256. Our method can produce high-quality images with finer details. Appendix \ref{['appendix: visualization']} presents more visualization results.
  • ...and 4 more figures