TQ-DiT: Efficient Time-Aware Quantization for Diffusion Transformers
Younghye Hwang, Hyojin Lee, Joonhyuk Kang
TL;DR
This work tackles the high computational cost of diffusion transformers by proposing TQ-DiT, a post-training quantization framework that is time-aware and region-sensitive. It introduces time-grouping quantization (TGQ) to handle timestep-dependent activation shifts, Hessian-guided optimization (HO) to weight perturbations by gradient information, and multi-region quantization (MRQ) to better quantize nonuniform post-activation distributions. Empirically, TQ-DiT achieves near full-precision performance at 8-bit (W8A8) and outperforms baselines at 6-bit (W6A6) on ImageNet with DiT-XL-2, while significantly reducing calibration memory and time. The approach demonstrates strong practical potential for efficient, real-time diffusion-based generation and offers a framework adaptable to temporal architectures beyond diffusion transformers.
Abstract
Diffusion transformers (DiTs) combine transformer architectures with diffusion models. However, their computational complexity imposes significant limitations on real-time applications and sustainability of AI systems. In this study, we aim to enhance the computational efficiency through model quantization, which represents the weights and activation values with lower precision. Multi-region quantization (MRQ) is introduced to address the asymmetric distribution of network values in DiT blocks by allocating two scaling parameters to sub-regions. Additionally, time-grouping quantization (TGQ) is proposed to reduce quantization error caused by temporal variation in activations. The experimental results show that the proposed algorithm achieves performance comparable to the original full-precision model with only a 0.29 increase in FID at W8A8. Furthermore, it outperforms other baselines at W6A6, thereby confirming its suitability for low-bit quantization. These results highlight the potential of our method to enable efficient real-time generative models.
