Table of Contents
Fetching ...

Post-Training Quantization via Residual Truncation and Zero Suppression for Diffusion Models

Donghoon Kim, Dongyoung Lee, Ik Joon Chang, Sung-Ho Bae

TL;DR

Diffusion models demand high compute for high-quality image generation, hindering deployment at scale. The paper introduces QuaRTZ, a two-stage 4-bit post-training quantization scheme that first applies 8-bit min-max quantization to capture outliers and then uses Leading Zero Suppression to compress to 4 bits while preserving low-magnitude details (LSBs). It provides a theoretical distortion bound and demonstrates empirically that QuaRTZ outperforms naive 4-bit quantization across multiple diffusion architectures and datasets, including a Fréchet Inception Distance of $6.98$ on FLUX. The approach achieves substantial memory savings (up to $3.8\times$ reduction) without auxiliary FP16 branches, enabling more practical deployment on modern accelerators, and shows promise for application to LLMs as well.

Abstract

Diffusion models achieve high-quality image generation but face deployment challenges due to their high computational requirements. Although 8-bit outlier-aware post-training quantization (PTQ) matches full-precision performance, extending PTQ to 4 bits remains challenging. Larger step sizes in 4-bit quantization amplify rounding errors in dense, low-magnitude activations, leading to the loss of fine-grained textures. We hypothesize that not only outliers but also small activations are critical for texture fidelity. To this end, we propose Quantization via Residual Truncation and Zero Suppression (QuaRTZ), a 4-bit PTQ scheme for diffusion models. QuaRTZ applies 8-bit min-max quantization for outlier handling and compresses to 4 bits via leading-zero suppression to retain LSBs, thereby preserving texture details. Our approach reduces rounding errors and improves quantization efficiency by balancing outlier preservation and LSB precision. Both theoretical derivations and empirical evaluations demonstrate the generalizability of QuaRTZ across diverse activation distributions. Notably, 4-bit QuaRTZ achieves an FID of 6.98 on FLUX.1-schnell, outperforming SVDQuant that requires auxiliary FP16 branches.

Post-Training Quantization via Residual Truncation and Zero Suppression for Diffusion Models

TL;DR

Diffusion models demand high compute for high-quality image generation, hindering deployment at scale. The paper introduces QuaRTZ, a two-stage 4-bit post-training quantization scheme that first applies 8-bit min-max quantization to capture outliers and then uses Leading Zero Suppression to compress to 4 bits while preserving low-magnitude details (LSBs). It provides a theoretical distortion bound and demonstrates empirically that QuaRTZ outperforms naive 4-bit quantization across multiple diffusion architectures and datasets, including a Fréchet Inception Distance of on FLUX. The approach achieves substantial memory savings (up to reduction) without auxiliary FP16 branches, enabling more practical deployment on modern accelerators, and shows promise for application to LLMs as well.

Abstract

Diffusion models achieve high-quality image generation but face deployment challenges due to their high computational requirements. Although 8-bit outlier-aware post-training quantization (PTQ) matches full-precision performance, extending PTQ to 4 bits remains challenging. Larger step sizes in 4-bit quantization amplify rounding errors in dense, low-magnitude activations, leading to the loss of fine-grained textures. We hypothesize that not only outliers but also small activations are critical for texture fidelity. To this end, we propose Quantization via Residual Truncation and Zero Suppression (QuaRTZ), a 4-bit PTQ scheme for diffusion models. QuaRTZ applies 8-bit min-max quantization for outlier handling and compresses to 4 bits via leading-zero suppression to retain LSBs, thereby preserving texture details. Our approach reduces rounding errors and improves quantization efficiency by balancing outlier preservation and LSB precision. Both theoretical derivations and empirical evaluations demonstrate the generalizability of QuaRTZ across diverse activation distributions. Notably, 4-bit QuaRTZ achieves an FID of 6.98 on FLUX.1-schnell, outperforming SVDQuant that requires auxiliary FP16 branches.

Paper Structure

This paper contains 30 sections, 1 theorem, 15 equations, 15 figures, 7 tables.

Key Result

Theorem 1

Let $X \in \mathbb{R}$ with density $p(x)$. Denote the quantization error of direct 4-bit uniform quantization as $E_q^4$, and the error of 8-bit quantization followed by LZS compression as $E_{\text{total}}$. If less than half of the probability mass lies in high-index bins ($|j| \ge 8$), then

Figures (15)

  • Figure 1: Qualitative comparison on PixArt-$\Sigma$ using different quantization settings and our method.
  • Figure 2: Illustration of the proposed two-stage quantization. Stage 1 applies 8-bit integer quantization to capture outliers with small step size, and Stage 2 compresses activations to 4 bits via subgroup-based leading-zero suppression. The green color indicates entropy (from low to high across values), while the blue block represents the FLAG bits assigned per subgroup.
  • Figure 3: Compared to naïve INT4 quantization, QuaRTZ avoids severe rounding errors in dense low-magnitude regions. The histogram is partitioned into FLAG regions (F0–F4): F0 denotes the preserved fine-grained region around zero, while F1–F4 correspond to progressively larger magnitude ranges captured via FLAG-based shifts. Despite compression, the magnitude of outliers is retained similarly to INT4 quantization.
  • Figure 4: Entropy analysis demonstrates that our method exhibits higher entropy at every layer compared to naïve INT4.
  • Figure 5: Generated images from different quantization methods on FLUX.1-schnell model on MJHQ dataset (up) and DCI dataset (down).
  • ...and 10 more figures

Theorems & Definitions (2)

  • Theorem 1: Error Bound for QuaRTZ
  • proof : Sketch of Proof