Table of Contents
Fetching ...

TQ-DiT: Efficient Time-Aware Quantization for Diffusion Transformers

Younghye Hwang, Hyojin Lee, Joonhyuk Kang

TL;DR

This work tackles the high computational cost of diffusion transformers by proposing TQ-DiT, a post-training quantization framework that is time-aware and region-sensitive. It introduces time-grouping quantization (TGQ) to handle timestep-dependent activation shifts, Hessian-guided optimization (HO) to weight perturbations by gradient information, and multi-region quantization (MRQ) to better quantize nonuniform post-activation distributions. Empirically, TQ-DiT achieves near full-precision performance at 8-bit (W8A8) and outperforms baselines at 6-bit (W6A6) on ImageNet with DiT-XL-2, while significantly reducing calibration memory and time. The approach demonstrates strong practical potential for efficient, real-time diffusion-based generation and offers a framework adaptable to temporal architectures beyond diffusion transformers.

Abstract

Diffusion transformers (DiTs) combine transformer architectures with diffusion models. However, their computational complexity imposes significant limitations on real-time applications and sustainability of AI systems. In this study, we aim to enhance the computational efficiency through model quantization, which represents the weights and activation values with lower precision. Multi-region quantization (MRQ) is introduced to address the asymmetric distribution of network values in DiT blocks by allocating two scaling parameters to sub-regions. Additionally, time-grouping quantization (TGQ) is proposed to reduce quantization error caused by temporal variation in activations. The experimental results show that the proposed algorithm achieves performance comparable to the original full-precision model with only a 0.29 increase in FID at W8A8. Furthermore, it outperforms other baselines at W6A6, thereby confirming its suitability for low-bit quantization. These results highlight the potential of our method to enable efficient real-time generative models.

TQ-DiT: Efficient Time-Aware Quantization for Diffusion Transformers

TL;DR

This work tackles the high computational cost of diffusion transformers by proposing TQ-DiT, a post-training quantization framework that is time-aware and region-sensitive. It introduces time-grouping quantization (TGQ) to handle timestep-dependent activation shifts, Hessian-guided optimization (HO) to weight perturbations by gradient information, and multi-region quantization (MRQ) to better quantize nonuniform post-activation distributions. Empirically, TQ-DiT achieves near full-precision performance at 8-bit (W8A8) and outperforms baselines at 6-bit (W6A6) on ImageNet with DiT-XL-2, while significantly reducing calibration memory and time. The approach demonstrates strong practical potential for efficient, real-time diffusion-based generation and offers a framework adaptable to temporal architectures beyond diffusion transformers.

Abstract

Diffusion transformers (DiTs) combine transformer architectures with diffusion models. However, their computational complexity imposes significant limitations on real-time applications and sustainability of AI systems. In this study, we aim to enhance the computational efficiency through model quantization, which represents the weights and activation values with lower precision. Multi-region quantization (MRQ) is introduced to address the asymmetric distribution of network values in DiT blocks by allocating two scaling parameters to sub-regions. Additionally, time-grouping quantization (TGQ) is proposed to reduce quantization error caused by temporal variation in activations. The experimental results show that the proposed algorithm achieves performance comparable to the original full-precision model with only a 0.29 increase in FID at W8A8. Furthermore, it outperforms other baselines at W6A6, thereby confirming its suitability for low-bit quantization. These results highlight the potential of our method to enable efficient real-time generative models.

Paper Structure

This paper contains 18 sections, 17 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Quantization performance is examined with weights and activations at both 8-bit precision (W8A8) and 6-bit precision (W6A6). The proposed TQ-DiT scheme achieves performance closest to that of the original full-precision models, as observed by the lowest FID and highest IS among the conventional quantization schemes.
  • Figure 2: Distribution of values after the Softmax (a), GELU (b) in DiT blocks. Since the values are non-uniformly distributed, conventional quantization can degrade performance significantly.
  • Figure 3: Maximum channel magnitudes after softmax are depicted for various timesteps during inference, revealing large variance across timesteps. This shows the necessity of handling timestep-dependent values effectively.
  • Figure 4: Illustration of the diffusion transformer (DiT) DiT with stacked transformer-based DiT blocks. Each block includes MHSA layers with softmax and PF layers with GELU activations, conditioned on class and timestep inputs.
  • Figure 5: Illustration of the proposed TQ-DiT. Multi-Region Quantization (MRQ) handles skewed distributions in post-softmax and post-GELU layers within MHSA and PF. Hessian-guided Optimization (HO) with Time-Grouping Quantization(TGQ) addresses timestep-dependent activation variability in post-softmax layers.
  • ...and 1 more figures