Table of Contents
Fetching ...

MPQ-Diff: Mixed Precision Quantization for Diffusion Models

Rocco Manz Maruzzelli, Basile Lewandowski, Lydia Y. Chen

TL;DR

MPQ-Diff introduces a diffusion-specific mixed-precision quantization framework that allocates per-layer bit-widths using a timestep-aware network orthogonality metric (ORM). It computes ORM across timesteps, aggregates with exponential weighting to identify layer importance, and solves a linear program to maximize aggregated orthogonality under a memory budget, enabling effective quantization without retraining. The method is compatible with fixed-precision baselines and demonstrates substantial gains in FID on LSUN and ImageNet datasets, including dramatic improvements with only modest increases in model size. The approach significantly reduces the sampling cost and memory footprint of diffusion processes, making high-quality generation more practical for real-world deployments.

Abstract

Diffusion models (DMs) generate remarkable high quality images via the stochastic denoising process, which unfortunately incurs high sampling time. Post-quantizing the trained diffusion models in fixed bit-widths, e.g., 4 bits on weights and 8 bits on activation, is shown effective in accelerating sampling time while maintaining the image quality. Motivated by the observation that the cross-layer dependency of DMs vary across layers and sampling steps, we propose a mixed precision quantization scheme, MPQ-Diff, which allocates different bit-width to the weights and activation of the layers. We advocate to use the cross-layer correlation of a given layer, termed network orthogonality metric, as a proxy to measure the relative importance of a layer per sampling step. We further adopt a uniform sampling scheme to avoid the excessive profiling overhead of estimating orthogonality across all time steps. We evaluate the proposed mixed-precision on LSUN and ImageNet, showing a significant improvement in FID from 65.73 to 15.39, and 52.66 to 14.93, compared to their fixed precision quantization, respectively.

MPQ-Diff: Mixed Precision Quantization for Diffusion Models

TL;DR

MPQ-Diff introduces a diffusion-specific mixed-precision quantization framework that allocates per-layer bit-widths using a timestep-aware network orthogonality metric (ORM). It computes ORM across timesteps, aggregates with exponential weighting to identify layer importance, and solves a linear program to maximize aggregated orthogonality under a memory budget, enabling effective quantization without retraining. The method is compatible with fixed-precision baselines and demonstrates substantial gains in FID on LSUN and ImageNet datasets, including dramatic improvements with only modest increases in model size. The approach significantly reduces the sampling cost and memory footprint of diffusion processes, making high-quality generation more practical for real-world deployments.

Abstract

Diffusion models (DMs) generate remarkable high quality images via the stochastic denoising process, which unfortunately incurs high sampling time. Post-quantizing the trained diffusion models in fixed bit-widths, e.g., 4 bits on weights and 8 bits on activation, is shown effective in accelerating sampling time while maintaining the image quality. Motivated by the observation that the cross-layer dependency of DMs vary across layers and sampling steps, we propose a mixed precision quantization scheme, MPQ-Diff, which allocates different bit-width to the weights and activation of the layers. We advocate to use the cross-layer correlation of a given layer, termed network orthogonality metric, as a proxy to measure the relative importance of a layer per sampling step. We further adopt a uniform sampling scheme to avoid the excessive profiling overhead of estimating orthogonality across all time steps. We evaluate the proposed mixed-precision on LSUN and ImageNet, showing a significant improvement in FID from 65.73 to 15.39, and 52.66 to 14.93, compared to their fixed precision quantization, respectively.

Paper Structure

This paper contains 15 sections, 9 equations, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: Samples of generated images under different bit precision.
  • Figure 2: Overview of the MPQ-Diff workflow. a) Deconstruct the DM into a set of functions $\mathcal{F}$, which are used across all $T$ generation timesteps. b) The ORM matrices for every sampled timestep is calculated from $\mathcal{F}$. c) Aggregation of all ORM matrices to obtain overall function importance across timesteps. d) LPP constructed by the importance factor $\theta$ to derive bit configuration.
  • Figure 3: Orthogonality Matrices across timesteps, for LDM-4 on Imagenet (steps = 20, eta = 0.0, scale = 3.0)
  • Figure 4: Activation ranges of $x_t$ across all 100 time steps of LDM-4 model on Imagenet 256 × 256. The blue regions represent the inter-quartile range (first to third quartile) of activation values, while the gray regions extend from the 5th to 95th percentiles.
  • Figure 5: $\theta_i$ across 200 generation timesteps on LDM-4 Imagenet 256 × 256 (steps = 200 eta = 0.0 scale = 3.0).
  • ...and 11 more figures