Table of Contents
Fetching ...

Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization

Weizhi Gao, Zhichao Hou, Junqi Yin, Feiyi Wang, Linyu Peng, Xiaorui Liu

TL;DR

Diffusion models incur high computational cost due to iterative sampling. The authors propose MoDiff, a training-free framework that uses modulated quantization and error compensation to exploit temporal redundancy in diffusion steps, enabling aggressive activation quantization (down to 3 bits) without sacrificing fidelity. Theoretical results bound quantization error and demonstrate exponential suppression of accumulation with error compensation, while extensive experiments on CIFAR-10, LSUN, and larger models show substantial compute savings (often >10x GBops) and preserved or improved generation quality across multiple samplers. MoDiff is solver-agnostic and quantization-method-agnostic, making it a versatile augmentation to PTQ and caching approaches with broad practical impact for efficient diffusion-based generation.

Abstract

Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an in-depth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our code implementation is available at https://github.com/WeizhiGao/MoDiff.

Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization

TL;DR

Diffusion models incur high computational cost due to iterative sampling. The authors propose MoDiff, a training-free framework that uses modulated quantization and error compensation to exploit temporal redundancy in diffusion steps, enabling aggressive activation quantization (down to 3 bits) without sacrificing fidelity. Theoretical results bound quantization error and demonstrate exponential suppression of accumulation with error compensation, while extensive experiments on CIFAR-10, LSUN, and larger models show substantial compute savings (often >10x GBops) and preserved or improved generation quality across multiple samplers. MoDiff is solver-agnostic and quantization-method-agnostic, making it a versatile augmentation to PTQ and caching approaches with broad practical impact for efficient diffusion-based generation.

Abstract

Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an in-depth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our code implementation is available at https://github.com/WeizhiGao/MoDiff.

Paper Structure

This paper contains 34 sections, 5 theorems, 37 equations, 10 figures, 21 tables.

Key Result

Theorem 4.3

Let $\mathbf{x} \in \mathbb{R}^d$ be a vector, and let the quantization bandwidth be $b \in \mathbb{N}$. Define the max-min dynamic quantizer as follows: The corresponding dequantization is given by: The quantization error is bounded in terms of the quantization scaling factor $s$, which depends on the range of $\mathbf{x}$ and the bandwidth $b$. Specifically, we have:

Figures (10)

  • Figure 1: A preliminary study using DDIM on CIFAR-10 with 100 generation steps. (a) The relative $\ell_2$ distance between the cached and standard diffusion features in middle block, initialized from the same noise. As the reuse frequency increases, error accumulation becomes more significant. (b) The distribution of activations and their temporal differences across different diffusion time steps. The blue violin plots show that activation ranges fluctuate over time and exhibit outliers with long-tailed distributions. In contrast, the orange violin plots demonstrate more consistent ranges and concentrated distributions.
  • Figure 2: (a) Standard PTQ methods: The computations at different time steps are independent, with the raw activation ${\mathbf{a}}_t^{(l)}$ serving directly as the input to the quantizer. (b) Quantization with our MoDiff: For each linear operator, such as linear layers and convolutional layers, we cache the output from the previous time step, $\hat{{\mathbf{a}}}_{t}^{(l)}$, and input the temporal difference ${\mathbf{a}}_{t-1}^{(l)}-\hat{{\mathbf{a}}}_{t}^{(l)}$ into the quantizer. The final output is obtained by aggregating the current computation results of ${\mathcal{A}}^{l}$ with the cached output from the previous step $\hat{{\mathbf{o}}}_{t}^{(l)}$.
  • Figure 3: The relative $\ell_2$ distance between the features in the standard diffusion model and the quantized model in middle block. "w/ EC" denotes the use of the error-compensation technique.
  • Figure 4: Visualization of MS-COCO-2014 generated using LTQ and LTQ+MoDiff under 8-bit weight quantization precisions on Stable Diffusion v1.4.
  • Figure 5: Visualization of LSUN-Churches $256\times256$ generated using LCQ and LCQ+MoDiff under 8-bit weight quantization precisions.
  • ...and 5 more figures

Theorems & Definitions (8)

  • Remark 4.1
  • Remark 4.2
  • Theorem 4.3: Quantization Error
  • Theorem 4.4
  • Remark 5.1
  • Theorem 1.1: Restated, \ref{['thm:quantize_error']}
  • Theorem 1.2: Restated, \ref{['thm:acc_error']}
  • Corollary 1.3