Table of Contents
Fetching ...

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han

TL;DR

Diffusion models face memory and latency bottlenecks at high fidelity. The authors introduce SVDQuant, a 4-bit post-training quantization framework that absorbs outliers with a low-rank 16-bit branch and handles residuals with 4-bit quantization, paired with the Nunchaku kernel-fusion engine to minimize data movement. This approach preserves image quality across large backbones (e.g., 12B FLUX.1) while achieving substantial memory reductions (≈3.5×) and speedups (≈3×) on modern GPUs, and it remains compatible with LoRA adapters without re-quantization. Together, these contribute to practical, edge-friendly deployment of large diffusion models with minimal quality loss and significant performance gains.

Abstract

Diffusion models can effectively generate high-quality images. However, as they scale, rising memory demands and higher latency pose substantial deployment challenges. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where existing post-training quantization methods like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing, which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights. Then, we use a high-precision, low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD), while a low-bit quantized branch handles the residuals. This process eases the quantization on both sides. However, naively running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without re-quantization. Extensive experiments on SDXL, PixArt-$Σ$, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5$\times$, achieving 3.0$\times$ speedup over the 4-bit weight-only quantization (W4A16) baseline on the 16GB laptop 4090 GPU with INT4 precision. On the latest RTX 5090 desktop with Blackwell architecture, we achieve a 3.1$\times$ speedup compared to the W4A16 model using NVFP4 precision.

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

TL;DR

Diffusion models face memory and latency bottlenecks at high fidelity. The authors introduce SVDQuant, a 4-bit post-training quantization framework that absorbs outliers with a low-rank 16-bit branch and handles residuals with 4-bit quantization, paired with the Nunchaku kernel-fusion engine to minimize data movement. This approach preserves image quality across large backbones (e.g., 12B FLUX.1) while achieving substantial memory reductions (≈3.5×) and speedups (≈3×) on modern GPUs, and it remains compatible with LoRA adapters without re-quantization. Together, these contribute to practical, edge-friendly deployment of large diffusion models with minimal quality loss and significant performance gains.

Abstract

Diffusion models can effectively generate high-quality images. However, as they scale, rising memory demands and higher latency pose substantial deployment challenges. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where existing post-training quantization methods like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing, which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights. Then, we use a high-precision, low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD), while a low-bit quantized branch handles the residuals. This process eases the quantization on both sides. However, naively running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without re-quantization. Extensive experiments on SDXL, PixArt-, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5, achieving 3.0 speedup over the 4-bit weight-only quantization (W4A16) baseline on the 16GB laptop 4090 GPU with INT4 precision. On the latest RTX 5090 desktop with Blackwell architecture, we achieve a 3.1 speedup compared to the W4A16 model using NVFP4 precision.

Paper Structure

This paper contains 25 sections, 4 theorems, 17 equations, 19 figures, 4 tables.

Key Result

Proposition 4.1

The quantization error can be decomposed as follows:

Figures (19)

  • Figure 1: SVDQuant is a post-training quantization technique for 4-bit weights and activations that well maintains visual fidelity. On 12B FLUX.1-dev, it achieves 3.6× memory reduction compared to the BF16 model. By eliminating CPU offloading, it offers 8.7× speedup over the 16-bit model when on a 16GB laptop 4090 GPU, 3× faster than the NF4 W4A16 baseline. On PixArt-$\Sigma$, it demonstrates significantly superior visual quality over other W4A4 or even W4A8 baselines. "E2E" means the end-to-end latency including the text encoder and VAE decoder.
  • Figure 2: Computation vs. parameters for LLMs and diffusion models. LLMs' computation is measured with 512 context and 256 output tokens, and diffusion models' computation is for a single step. Dashed lines show trends.
  • Figure 3: Overview of SVDQuant. (a) Originally, both the activation ${\bm{X}}$ and weight ${\bm{W}}$ contain outliers, making 4-bit quantization challenging. (b) We migrate the outliers from the activation to weight, resulting in the updated activation $\hat{{\bm{X}}}$ and weight $\hat{{\bm{W}}}$. While $\hat{{\bm{X}}}$ becomes easier to quantize, $\hat{{\bm{W}}}$ now becomes more difficult. (c) SVDQuant further decomposes $\hat{{\bm{W}}}$ into a low-rank component ${\bm{L}}_1{\bm{L}}_2$ and a residual $\hat{{\bm{W}}}-{\bm{L}}_1{\bm{L}}_2$ with SVD. Thus, the quantization difficulty is alleviated by the low-rank branch, which runs at 16-bit precision.
  • Figure 4: Example value distribution of inputs and weights in PixArt-$\Sigma$chen2024pixart${\bm{\lambda}}$ is the smooth factor. Red indicates the outliers. Initially, both the input ${\bm{X}}$ and weight ${\bm{W}}$ contain significant outliers. After smoothing, the range of $\hat{{\bm{X}}}$ is reduced with much fewer outliers, while $\hat{{\bm{W}}}$ shows more outliers. Once the SVD low-rank branch ${\bm{L}}_1{\bm{L}}_2$ is subtracted, the residual ${\bm{R}}$ has a narrower range and is free from outliers.
  • Figure 5: First 64 singular values of ${\bm{W}}$, $\hat{{\bm{W}}}$, and ${\bm{R}}$. The first 32 singular values of $\hat{{\bm{W}}}$ exhibit a steep drop, while the remaining values are much more gradual.
  • ...and 14 more figures

Theorems & Definitions (6)

  • Proposition 4.1: Error decomposition
  • Proposition 4.2: Quantization error bound
  • Proposition 4.1
  • proof
  • Proposition 4.2
  • proof