Table of Contents
Fetching ...

Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion

Shuaiting Li, Juncan Deng, Zeyu Wang, Kedong Xu, Rongtao Deng, Hong Gu, Haibin Shen, Kejie Huang

TL;DR

This framework introduces a Serial-to-Parallel pipeline that simultaneously maintains training-inference consistency and ensures optimization stability, and develops several techniques including multi-timestep activation quantization, time information precalculation, inter-layer distillation, and selective freezing to achieve high-fidelity generation in comparison to floating-point models while maintaining quantization efficiency.

Abstract

Text-to-image generation via Stable Diffusion models (SDM) have demonstrated remarkable capabilities. However, their computational intensity, particularly in the iterative denoising process, hinders real-time deployment in latency-sensitive applications. While Recent studies have explored post-training quantization (PTQ) and quantization-aware training (QAT) methods to compress Diffusion models, existing methods often overlook the consistency between results generated by quantized models and those from floating-point models. This consistency is paramount for professional applications where both efficiency and output reliability are essential. To ensure that quantized SDM generates high-quality and consistent images, we propose an efficient quantization framework for SDM. Our framework introduces a Serial-to-Parallel pipeline that simultaneously maintains training-inference consistency and ensures optimization stability. Building upon this foundation, we further develop several techniques including multi-timestep activation quantization, time information precalculation, inter-layer distillation, and selective freezing, to achieve high-fidelity generation in comparison to floating-point models while maintaining quantization efficiency. Through comprehensive evaluation across multiple Stable Diffusion variants (v1-4, v2-1, XL 1.0, and v3), our method demonstrates superior performance over state-of-the-art approaches with shorter training times. Under W4A8 quantization settings, we achieve significant improvements in both distribution similarity and visual fidelity, while preserving a high image quality.

Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion

TL;DR

This framework introduces a Serial-to-Parallel pipeline that simultaneously maintains training-inference consistency and ensures optimization stability, and develops several techniques including multi-timestep activation quantization, time information precalculation, inter-layer distillation, and selective freezing to achieve high-fidelity generation in comparison to floating-point models while maintaining quantization efficiency.

Abstract

Text-to-image generation via Stable Diffusion models (SDM) have demonstrated remarkable capabilities. However, their computational intensity, particularly in the iterative denoising process, hinders real-time deployment in latency-sensitive applications. While Recent studies have explored post-training quantization (PTQ) and quantization-aware training (QAT) methods to compress Diffusion models, existing methods often overlook the consistency between results generated by quantized models and those from floating-point models. This consistency is paramount for professional applications where both efficiency and output reliability are essential. To ensure that quantized SDM generates high-quality and consistent images, we propose an efficient quantization framework for SDM. Our framework introduces a Serial-to-Parallel pipeline that simultaneously maintains training-inference consistency and ensures optimization stability. Building upon this foundation, we further develop several techniques including multi-timestep activation quantization, time information precalculation, inter-layer distillation, and selective freezing, to achieve high-fidelity generation in comparison to floating-point models while maintaining quantization efficiency. Through comprehensive evaluation across multiple Stable Diffusion variants (v1-4, v2-1, XL 1.0, and v3), our method demonstrates superior performance over state-of-the-art approaches with shorter training times. Under W4A8 quantization settings, we achieve significant improvements in both distribution similarity and visual fidelity, while preserving a high image quality.

Paper Structure

This paper contains 17 sections, 10 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Comparison of (A) 'Serial' pipeline which mimics the inference process of SDM and (B) 'Parallel' pipeline which is more aligned with the pretraining process of SDM. $X_{t}^{n}$ denotes the $n^{th}$ noisy input at timestep $t$. $P$ denotes prompts. $\mu$ denotes the predicted noise.
  • Figure 2: Box plot illustrating the gradient variations in the quantized Stable Diffusion v1-4 model during training. (A) represents the serial pipeline, and (B) represents the parallel pipeline.
  • Figure 3: Comparison of loss (left) and weight oscillation (right) between serial and parallel pipeline. Serial pipeline suffers from severe oscillation
  • Figure 4: Difference in noisy input range at each timestep with the same initial latent. (a) Adding Gaussian noise based on Eq. \ref{['eq1']}. (b) Step-by-step denoising during inference.
  • Figure 5: Overview of our quantization framework. (A) Serial dataset generation: During the inference of floating-point models, latent generated from various timesteps for each prompt are randomly sampled. (B) Time information precalculation: The feature map of time projection layers is precalculated for training and inference. (C) Parallel training: At each iteration, latent from various timesteps along with the corresponding prompts are selected from the dataset. The Loss function is calculated between the output and the sensitive layers. Iterative freezing is applied to these sensitive layers for better stability.
  • ...and 4 more figures