Low-Bitwidth Floating Point Quantization for Efficient High-Quality Diffusion Models

Cheng Chen; Christina Giannoula; Andreas Moshovos

Low-Bitwidth Floating Point Quantization for Efficient High-Quality Diffusion Models

Cheng Chen, Christina Giannoula, Andreas Moshovos

TL;DR

This work tackles the high computational and memory demands of diffusion models by proposing low-bitwidth floating-point post-training quantization (FP8/FP4) with gradient-based rounding. The authors demonstrate that FP8 quantization can preserve near full-precision image quality, while FP4 quantization becomes viable with a learned rounding mechanism, outperforming state-of-the-art integer quantization at the same bitwidth. A key contribution is per-tensor encoding/bias selection and a calibration-driven rounding learner, enabling effective FP quantization across unconditional and text-to-image generation, including on large models like SDXL. The results, complemented by a revised evaluation protocol, show that floating-point quantization not only reduces memory and compute costs but also yields better perceptual quality, partly due to markedly increased sparsity that offers additional acceleration opportunities.

Abstract

Diffusion models are emerging models that generate images by iteratively denoising random Gaussian noise using deep neural networks. These models typically exhibit high computational and memory demands, necessitating effective post-training quantization for high-performance inference. Recent works propose low-bitwidth (e.g., 8-bit or 4-bit) quantization for diffusion models, however 4-bit integer quantization typically results in low-quality images. We observe that on several widely used hardware platforms, there is little or no difference in compute capability between floating-point and integer arithmetic operations of the same bitwidth (e.g., 8-bit or 4-bit). Therefore, we propose an effective floating-point quantization method for diffusion models that provides better image quality compared to integer quantization methods. We employ a floating-point quantization method that was effective for other processing tasks, specifically computer vision and natural language tasks, and tailor it for diffusion models by integrating weight rounding learning during the mapping of the full-precision values to the quantized values in the quantization process. We comprehensively study integer and floating-point quantization methods in state-of-the-art diffusion models. Our floating-point quantization method not only generates higher-quality images than that of integer quantization methods, but also shows no noticeable degradation compared to full-precision models (32-bit floating-point), when both weights and activations are quantized to 8-bit floating-point values, while has minimal degradation with 4-bit weights and 8-bit activations.

Low-Bitwidth Floating Point Quantization for Efficient High-Quality Diffusion Models

TL;DR

Abstract

Paper Structure (19 sections, 12 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 12 equations, 11 figures, 5 tables, 1 algorithm.

Introduction
Diffusion Model Basics
Characterization of Compute and Memory Requirements
Post-Training Quantization (PTQ)
Uniform Integer Quantization
Floating Point Quantization
Our Floating-Point Quantization Method
Encoding and Bias Value Selection
Gradient-Based Rounding Learning for Low-Bitwidth Weights
Experiments and Results
Methodology
Image Generation Quality Metrics
Facilitating Fair Comparisons Across Runs
Unconditional Image Generation
Text-to-Image Generation
...and 4 more sections

Figures (11)

Figure 1: Stable Diffusion Architecture.
Figure 2: Diffusion model forward process: it converts an image into Gaussian random noise.
Figure 3: Diffusion model backward process: it generates an image by iteratively denoising from a Gaussian random noise.
Figure 4: Breakdown of inference latency for different types of layers, when running Stable Diffusion with batch size 1 and 8 on a CPU and a GPU.
Figure 5: Inference memory requirements
...and 6 more figures

Low-Bitwidth Floating Point Quantization for Efficient High-Quality Diffusion Models

TL;DR

Abstract

Low-Bitwidth Floating Point Quantization for Efficient High-Quality Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)