Table of Contents
Fetching ...

Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning

Maosen Zhao, Pengtao Chen, Chong Yu, Yan Wen, Xudong Tan, Tao Chen

TL;DR

This work tackles the challenge of 4-bit floating-point quantization for diffusion models, a domain where traditional INT quantization and PTQ-based fine-tuning struggle. It introduces Mixup-Sign Floating Point Quantization (MSFP) to handle activation asymmetry by applying unsigned FP with a zero point to anomalous-activation layers, while retaining signed FP for normal layers, and it leverages a timestep-aware LoRA router (TALoRA) to allocate multiple LoRAs across diffusion timesteps. To align fine-tuning with the actual quantization impact, the method includes a denoising-factor aligned loss (DFA) that scales the loss by the denoising factor $\gamma_t$. Through extensive experiments on DDIM and LDM pipelines across multiple datasets, the approach achieves state-of-the-art 4-bit FP diffusion model performance, with 6-bit results closely approaching full precision and clear gains over PTQ baselines, signaling practical viability for efficient diffusion-model deployment.

Abstract

Model quantization reduces the bit-width of weights and activations, improving memory efficiency and inference speed in diffusion models. However, achieving 4-bit quantization remains challenging. Existing methods, primarily based on integer quantization and post-training quantization fine-tuning, struggle with inconsistent performance. Inspired by the success of floating-point (FP) quantization in large language models, we explore low-bit FP quantization for diffusion models and identify key challenges: the failure of signed FP quantization to handle asymmetric activation distributions, the insufficient consideration of temporal complexity in the denoising process during fine-tuning, and the misalignment between fine-tuning loss and quantization error. To address these challenges, we propose the mixup-sign floating-point quantization (MSFP) framework, first introducing unsigned FP quantization in model quantization, along with timestep-aware LoRA (TALoRA) and denoising-factor loss alignment (DFA), which ensure precise and stable fine-tuning. Extensive experiments show that we are the first to achieve superior performance in 4-bit FP quantization for diffusion models, outperforming existing PTQ fine-tuning methods in 4-bit INT quantization.

Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning

TL;DR

This work tackles the challenge of 4-bit floating-point quantization for diffusion models, a domain where traditional INT quantization and PTQ-based fine-tuning struggle. It introduces Mixup-Sign Floating Point Quantization (MSFP) to handle activation asymmetry by applying unsigned FP with a zero point to anomalous-activation layers, while retaining signed FP for normal layers, and it leverages a timestep-aware LoRA router (TALoRA) to allocate multiple LoRAs across diffusion timesteps. To align fine-tuning with the actual quantization impact, the method includes a denoising-factor aligned loss (DFA) that scales the loss by the denoising factor . Through extensive experiments on DDIM and LDM pipelines across multiple datasets, the approach achieves state-of-the-art 4-bit FP diffusion model performance, with 6-bit results closely approaching full precision and clear gains over PTQ baselines, signaling practical viability for efficient diffusion-model deployment.

Abstract

Model quantization reduces the bit-width of weights and activations, improving memory efficiency and inference speed in diffusion models. However, achieving 4-bit quantization remains challenging. Existing methods, primarily based on integer quantization and post-training quantization fine-tuning, struggle with inconsistent performance. Inspired by the success of floating-point (FP) quantization in large language models, we explore low-bit FP quantization for diffusion models and identify key challenges: the failure of signed FP quantization to handle asymmetric activation distributions, the insufficient consideration of temporal complexity in the denoising process during fine-tuning, and the misalignment between fine-tuning loss and quantization error. To address these challenges, we propose the mixup-sign floating-point quantization (MSFP) framework, first introducing unsigned FP quantization in model quantization, along with timestep-aware LoRA (TALoRA) and denoising-factor loss alignment (DFA), which ensure precise and stable fine-tuning. Extensive experiments show that we are the first to achieve superior performance in 4-bit FP quantization for diffusion models, outperforming existing PTQ fine-tuning methods in 4-bit INT quantization.

Paper Structure

This paper contains 27 sections, 10 equations, 12 figures, 11 tables, 1 algorithm.

Figures (12)

  • Figure 1: The activation distributions in NALs and AALs, results on the CelebA dataset. (a) The paradigm of NALs with symmetric activations. (b) The typical paradigm of AALs with asymmetric activations, where unsigned FP quantization is more suitable. (c) The infrequent paradigm of AALs with relatively symmetric activations, where either signed or unsigned FP quantization could be applicable.
  • Figure 2: Effect of bit-width reduction on activation representation capacity in AALs (blue) and NALs (orange) under signed FP quantization, evaluated on CelebA dataset.
  • Figure 3: Two loss, and performance degradation between the quantized and full-precision models across steps. Compared with metric, the original loss shows an inverse trend, while the aligned loss remains consistent.
  • Figure 4: The MSE of activations before and after quantization across all AALs under four different strategies, normalized against the baseline of signed FP quantization without zero point (purple).
  • Figure 5: The pipeline of our proposed method. UNets are applied to the Mixup-Sign Floating-Point Quantization (MSFP), where distinct floating-point quantization schemes are employed for Anomalous-Activation-Distribution Layers (AALs) and Normal-Activation-Distribution Layers (NALs). During the fine-tuning stage, multiple LoRA modules are introduced, and a timestep-aware routing mechanism is used for dynamic LoRA allocation across different timesteps. Additionally, a denoising-factor alignment technique is employed to align the loss function with quantization-induced performance degradation.
  • ...and 7 more figures