Table of Contents
Fetching ...

FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers

Ruichen Chen, Keith G. Mills, Di Niu

TL;DR

FP4DiT tackles the practical deployment of diffusion transformers by applying floating-point post-training quantization (FPQ) to DiTs, including PixArt and Hunyuan, achieving W4A6 while outperforming INT PTQ baselines on CLIP, ImageReward and HPSv2. The method combines optimized FP formats within DiT blocks, scale-aware AdaRound for FP weight calibration, and token-wise online activation quantization to handle patch-level activation dynamics. Empirical results on MS-COCO and HPSv2 across multiple DiT backbones demonstrate superior quantitative and human-preference performance with minimal hardware overhead. This work suggests FPQ as a promising direction for efficient, high-quality diffusion-based image synthesis on edge devices.

Abstract

Diffusion Models (DM) have revolutionized the text-to-image visual generation process. However, the large computational cost and model footprint of DMs hinders practical deployment, especially on edge devices. Post-training quantization (PTQ) is a lightweight method to alleviate these burdens without the need for training or fine-tuning. While recent DM PTQ methods achieve W4A8 \blue{(i.e., 4-bit weights and 8-bit activations)} on integer-based PTQ, two key limitations remain: First, while most existing DM PTQ methods evaluate on classical DMs like Stable Diffusion XL, 1.5 or earlier, which use convolutional U-Nets, newer Diffusion Transformer (DiT) models like the PixArt series, Hunyuan and others adopt fundamentally different transformer backbones to achieve superior image synthesis. Second, integer (INT) quantization is prevailing in DM PTQ but does not align well with the network weight and activation distribution, while Floating-Point Quantization (FPQ) is still under-investigated, yet it holds the potential to better align the weight and activation distributions in low-bit settings for DiT. In this paper, we introduce FP4DiT, a PTQ method that leverages FPQ to achieve W4A6 quantization. Specifically, we extend and generalize the Adaptive Rounding PTQ technique to adequately calibrate weight quantization for FPQ and demonstrate that DiT activations depend on input patch data, necessitating robust online activation quantization techniques. Experimental results demonstrate that FP4DiT achieves higher CLIP, ImageReward and HPSv2 performance compared to integer-based PTQ at the W4A6 and W4A8 precision levels while generating convincing visual content on PixArt-$α$, PixArt-$Σ$ and Hunyuan.

FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers

TL;DR

FP4DiT tackles the practical deployment of diffusion transformers by applying floating-point post-training quantization (FPQ) to DiTs, including PixArt and Hunyuan, achieving W4A6 while outperforming INT PTQ baselines on CLIP, ImageReward and HPSv2. The method combines optimized FP formats within DiT blocks, scale-aware AdaRound for FP weight calibration, and token-wise online activation quantization to handle patch-level activation dynamics. Empirical results on MS-COCO and HPSv2 across multiple DiT backbones demonstrate superior quantitative and human-preference performance with minimal hardware overhead. This work suggests FPQ as a promising direction for efficient, high-quality diffusion-based image synthesis on edge devices.

Abstract

Diffusion Models (DM) have revolutionized the text-to-image visual generation process. However, the large computational cost and model footprint of DMs hinders practical deployment, especially on edge devices. Post-training quantization (PTQ) is a lightweight method to alleviate these burdens without the need for training or fine-tuning. While recent DM PTQ methods achieve W4A8 \blue{(i.e., 4-bit weights and 8-bit activations)} on integer-based PTQ, two key limitations remain: First, while most existing DM PTQ methods evaluate on classical DMs like Stable Diffusion XL, 1.5 or earlier, which use convolutional U-Nets, newer Diffusion Transformer (DiT) models like the PixArt series, Hunyuan and others adopt fundamentally different transformer backbones to achieve superior image synthesis. Second, integer (INT) quantization is prevailing in DM PTQ but does not align well with the network weight and activation distribution, while Floating-Point Quantization (FPQ) is still under-investigated, yet it holds the potential to better align the weight and activation distributions in low-bit settings for DiT. In this paper, we introduce FP4DiT, a PTQ method that leverages FPQ to achieve W4A6 quantization. Specifically, we extend and generalize the Adaptive Rounding PTQ technique to adequately calibrate weight quantization for FPQ and demonstrate that DiT activations depend on input patch data, necessitating robust online activation quantization techniques. Experimental results demonstrate that FP4DiT achieves higher CLIP, ImageReward and HPSv2 performance compared to integer-based PTQ at the W4A6 and W4A8 precision levels while generating convincing visual content on PixArt-, PixArt- and Hunyuan.

Paper Structure

This paper contains 29 sections, 2 theorems, 13 equations, 15 figures, 12 tables, 1 algorithm.

Key Result

Theorem 1

Let $s$ be the quantization scale corresponding to the rounding mask $V$. Then for gradient descent, given as $\mathbf{V}_{n+1} = \mathbf{V}_{n} - \alpha \nabla F(\mathbf{V}_{n})$, the subtraction $\nabla F(\mathbf{V}_{n})$ is dependent on the scalar $s$.

Figures (15)

  • Figure 1: Value distributions for INT4 and three variants of FP4: E1M2, E2M1 and E3M0. Note that E0M3 is INT4. Observe how INT4 values are evenly distributed, while FP4 values cluster at the origin as the number of exponent (E) bits increases.
  • Figure 3: The GELU activation and its sensitive interval. With the same amount of discrete values, non-uniform quantization can better capture the sensitive interval.
  • Figure 4: (a) The binary gate function of INT AdaRound. All gates are identical because there is only one scale in INT quantization. (b) The binary gate functions of origin AdaRound on FP quantization. (c) The binary gate functions of scale-aware AdaRound. The red dashed line indicates the demarcation of rounding up (right) or down (left). Our scale-aware AdaRound normalizes the slope near the turning point, which stabilizes the optimization and helps improve the quantization performance.
  • Figure 5: (a) Different timestep input values for PixArt-$\alpha$ on 128 images sampled from MS-COCO. The input does not shrink progressively across timesteps like U-Net DM. (b) The time-embedded scale for the output of the 7th DiT block's FeedForward. It is almost constant across timesteps. (c) The output of the 7th DiT block. Its range tends to remain constant but shifts as a function of time.
  • Figure 6: The distribution of the absolute maximum for each token's activation among 4096 tokens in the PixArt-$\alpha$ model. The distribution demonstrates a strong patch dependency in the DiT activation.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof