Table of Contents
Fetching ...

HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization

Wenxuan Liu, Sai Qian Zhang

TL;DR

Diffusion Transformers offer high-quality image generation but are expensive to run on limited devices. The authors propose HQ-DiT, a post-training quantization method that uses FP4 for both weights and activations, aided by Hadamard-based outlier mitigation and a data-distribution–driven FP format selection, with GPTQ adaptation for FP quantization. The results show 4-bit FP quantization achieves near full-precision performance on ImageNet diffusion tasks, with only a small sFID penalty, and yields substantial speedups and memory savings. This enables efficient deployment of DiTs on resource-constrained platforms without retraining.

Abstract

Diffusion Transformers (DiTs) have recently gained substantial attention in both industrial and academic fields for their superior visual generation capabilities, outperforming traditional diffusion models that use U-Net. However,the enhanced performance of DiTs also comes with high parameter counts and implementation costs, seriously restricting their use on resource-limited devices such as mobile phones. To address these challenges, we introduce the Hybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training quantization method that utilizes 4-bit floating-point (FP) precision on both weights and activations for DiT inference. Compared to fixed-point quantization (e.g., INT8), FP quantization, complemented by our proposed clipping range selection mechanism, naturally aligns with the data distribution within DiT, resulting in a minimal quantization error. Furthermore, HQ-DiT also implements a universal identity mathematical transform to mitigate the serious quantization error caused by the outliers. The experimental results demonstrate that DiT can achieve extremely low-precision quantization (i.e., 4 bits) with negligible impact on performance. Our approach marks the first instance where both weights and activations in DiTs are quantized to just 4 bits, with only a 0.12 increase in sFID on ImageNet.

HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization

TL;DR

Diffusion Transformers offer high-quality image generation but are expensive to run on limited devices. The authors propose HQ-DiT, a post-training quantization method that uses FP4 for both weights and activations, aided by Hadamard-based outlier mitigation and a data-distribution–driven FP format selection, with GPTQ adaptation for FP quantization. The results show 4-bit FP quantization achieves near full-precision performance on ImageNet diffusion tasks, with only a small sFID penalty, and yields substantial speedups and memory savings. This enables efficient deployment of DiTs on resource-constrained platforms without retraining.

Abstract

Diffusion Transformers (DiTs) have recently gained substantial attention in both industrial and academic fields for their superior visual generation capabilities, outperforming traditional diffusion models that use U-Net. However,the enhanced performance of DiTs also comes with high parameter counts and implementation costs, seriously restricting their use on resource-limited devices such as mobile phones. To address these challenges, we introduce the Hybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training quantization method that utilizes 4-bit floating-point (FP) precision on both weights and activations for DiT inference. Compared to fixed-point quantization (e.g., INT8), FP quantization, complemented by our proposed clipping range selection mechanism, naturally aligns with the data distribution within DiT, resulting in a minimal quantization error. Furthermore, HQ-DiT also implements a universal identity mathematical transform to mitigate the serious quantization error caused by the outliers. The experimental results demonstrate that DiT can achieve extremely low-precision quantization (i.e., 4 bits) with negligible impact on performance. Our approach marks the first instance where both weights and activations in DiTs are quantized to just 4 bits, with only a 0.12 increase in sFID on ImageNet.
Paper Structure (25 sections, 13 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 13 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Performance of different approaches on ImageNet $256 \times 256$. Both weights and activations are quantized with 4 bits. The x-axis denotes the runtime for each quantization approach. The size of the circle indicates the standard deviation.
  • Figure 2: A DiT block.
  • Figure 3: (a) Magnitude distribution of an input activation of a DiT linear layer before and after Hadamard transform. (b) Histogram on an input activation matrix across different time steps.
  • Figure 4: Quantization workflow in (a) SA block and (b) FFN block for DiT.
  • Figure 5: HQ-DiT quantization scheme within a DiT block. Operations in FP4 and FP32 are highlighted in blud and red, respectively.
  • ...and 7 more figures