Table of Contents
Fetching ...

MixDiT: Accelerating Image Diffusion Transformer Inference with Mixed-Precision MX Quantization

Daeun Kim, Jinwoo Hwang, Changhun Oh, Jongse Park

TL;DR

Diffusion Transformer models are accurate but slow to run due to iterative denoising and heavy GEMM workloads. MixDiT introduces a magnitude-based mixed-precision MX quantization scheme that allocates high-precision MX9 to activation outliers while using MX6 for the rest, combined with a precision-flexible MX accelerator to accelerate inference. Key contributions include channel- and head-wise mixed-precision quantization controlled by offline hyperparameters, an offline optimization procedure, and a systolic-array based accelerator with an MX converter and reordering controller. Empirical results show MixDiT achieves a latency improvement of about 2.10×–5.32× over RTX-3090 with no degradation in generation quality across multiple diffusion transformer models. This work offers a practical path toward efficient diffusion-based image synthesis on commodity hardware by tightly integrating algorithmic quantization strategies with specialized hardware.

Abstract

Diffusion Transformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-precision quantization for DiT inferencing with both high accuracy and substantial speedup remains an open problem. To this end, this paper proposes MixDiT, an algorithm-hardware co-designed acceleration solution that exploits mixed Microscaling (MX) formats to quantize DiT activation values. MixDiT quantizes the DiT activation tensors by selectively applying higher precision to magnitude-based outliers, which produce mixed-precision GEMM operations. To achieve tangible speedup from the mixed-precision arithmetic, we design a MixDiT accelerator that enables precision-flexible multiplications and efficient MX precision conversions. Our experimental results show that MixDiT delivers a speedup of 2.10-5.32 times over RTX 3090, with no loss in FID.

MixDiT: Accelerating Image Diffusion Transformer Inference with Mixed-Precision MX Quantization

TL;DR

Diffusion Transformer models are accurate but slow to run due to iterative denoising and heavy GEMM workloads. MixDiT introduces a magnitude-based mixed-precision MX quantization scheme that allocates high-precision MX9 to activation outliers while using MX6 for the rest, combined with a precision-flexible MX accelerator to accelerate inference. Key contributions include channel- and head-wise mixed-precision quantization controlled by offline hyperparameters, an offline optimization procedure, and a systolic-array based accelerator with an MX converter and reordering controller. Empirical results show MixDiT achieves a latency improvement of about 2.10×–5.32× over RTX-3090 with no degradation in generation quality across multiple diffusion transformer models. This work offers a practical path toward efficient diffusion-based image synthesis on commodity hardware by tightly integrating algorithmic quantization strategies with specialized hardware.

Abstract

Diffusion Transformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-precision quantization for DiT inferencing with both high accuracy and substantial speedup remains an open problem. To this end, this paper proposes MixDiT, an algorithm-hardware co-designed acceleration solution that exploits mixed Microscaling (MX) formats to quantize DiT activation values. MixDiT quantizes the DiT activation tensors by selectively applying higher precision to magnitude-based outliers, which produce mixed-precision GEMM operations. To achieve tangible speedup from the mixed-precision arithmetic, we design a MixDiT accelerator that enables precision-flexible multiplications and efficient MX precision conversions. Our experimental results show that MixDiT delivers a speedup of 2.10-5.32 times over RTX 3090, with no loss in FID.

Paper Structure

This paper contains 12 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Denoising process of image generation diffusion with total timestep $T$ and architecture of diffusion transformer.
  • Figure 2: Block floating point and microscaling (MX) formats. While our design uses a group size of 16, the illustration depicts a group size of 4 for clarity.
  • Figure 3: Magnitude distributions of weights and activations in DiT-XL-256.
  • Figure 4: Impact of large-magnitude values on MX quantization degradation.
  • Figure 5: Observation of linear layer activation's value magnitude in DiT-XL-512. The colors mean same with Fig. \ref{['fig:distribution']}.
  • ...and 6 more figures