Table of Contents
Fetching ...

DGQ: Distribution-Aware Group Quantization for Text-to-Image Diffusion Models

Hyogon Ryu, NaHyeon Park, Hyunjung Shim

TL;DR

DGQ introduces distribution-aware group quantization to quantize text-to-image diffusion models at low bit-width without weight fine-tuning. By preserving activation outliers through a dimension-selective grouping strategy and applying prompt-specific logarithmic quantization to cross-attention (with start-token handling and dynamic scaling), it achieves strong image fidelity and text-image alignment on MS-COCO and PartiPrompts, while dramatically reducing compute via fewer bit operations. The approach yields an FID improvement over full precision (e.g., 13.15 vs 14.44) and maintains near-vanishing CLIP changes, with substantial robustness at 6-bit activations where baselines fail. This PTQ-friendly method broadens practical deployment of diffusion models, including on edge devices, without requiring additional weight quantization or fine-tuning.

Abstract

Despite the widespread use of text-to-image diffusion models across various tasks, their computational and memory demands limit practical applications. To mitigate this issue, quantization of diffusion models has been explored. It reduces memory usage and computational costs by compressing weights and activations into lower-bit formats. However, existing methods often struggle to preserve both image quality and text-image alignment, particularly in lower-bit($<$ 8bits) quantization. In this paper, we analyze the challenges associated with quantizing text-to-image diffusion models from a distributional perspective. Our analysis reveals that activation outliers play a crucial role in determining image quality. Additionally, we identify distinctive patterns in cross-attention scores, which significantly affects text-image alignment. To address these challenges, we propose Distribution-aware Group Quantization (DGQ), a method that identifies and adaptively handles pixel-wise and channel-wise outliers to preserve image quality. Furthermore, DGQ applies prompt-specific logarithmic quantization scales to maintain text-image alignment. Our method demonstrates remarkable performance on datasets such as MS-COCO and PartiPrompts. We are the first to successfully achieve low-bit quantization of text-to-image diffusion models without requiring additional fine-tuning of weight quantization parameters. Code is available at https://github.com/ugonfor/DGQ.

DGQ: Distribution-Aware Group Quantization for Text-to-Image Diffusion Models

TL;DR

DGQ introduces distribution-aware group quantization to quantize text-to-image diffusion models at low bit-width without weight fine-tuning. By preserving activation outliers through a dimension-selective grouping strategy and applying prompt-specific logarithmic quantization to cross-attention (with start-token handling and dynamic scaling), it achieves strong image fidelity and text-image alignment on MS-COCO and PartiPrompts, while dramatically reducing compute via fewer bit operations. The approach yields an FID improvement over full precision (e.g., 13.15 vs 14.44) and maintains near-vanishing CLIP changes, with substantial robustness at 6-bit activations where baselines fail. This PTQ-friendly method broadens practical deployment of diffusion models, including on edge devices, without requiring additional weight quantization or fine-tuning.

Abstract

Despite the widespread use of text-to-image diffusion models across various tasks, their computational and memory demands limit practical applications. To mitigate this issue, quantization of diffusion models has been explored. It reduces memory usage and computational costs by compressing weights and activations into lower-bit formats. However, existing methods often struggle to preserve both image quality and text-image alignment, particularly in lower-bit( 8bits) quantization. In this paper, we analyze the challenges associated with quantizing text-to-image diffusion models from a distributional perspective. Our analysis reveals that activation outliers play a crucial role in determining image quality. Additionally, we identify distinctive patterns in cross-attention scores, which significantly affects text-image alignment. To address these challenges, we propose Distribution-aware Group Quantization (DGQ), a method that identifies and adaptively handles pixel-wise and channel-wise outliers to preserve image quality. Furthermore, DGQ applies prompt-specific logarithmic quantization scales to maintain text-image alignment. Our method demonstrates remarkable performance on datasets such as MS-COCO and PartiPrompts. We are the first to successfully achieve low-bit quantization of text-to-image diffusion models without requiring additional fine-tuning of weight quantization parameters. Code is available at https://github.com/ugonfor/DGQ.
Paper Structure (22 sections, 8 equations, 14 figures, 6 tables)

This paper contains 22 sections, 8 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Memory requirements and computational cost of Stable diffusion v1.4.
  • Figure 2: The impact of DGQ. (a) Two types of performance degradation in text-to-image diffusion model quantization. DGQ preserves both text-image alignment (as shown above) and image quality (as shown below) significantly better than TFMQ-DM. Each model is quantized to the 8-bits setting (both weight and activation). (b) Performance comparison with other methods.
  • Figure 3: Comparison of quantization strategies. We show layer-wise, channel-wise and group-wise quantization methods. Minmax and MSE (mean-squared error) are the most common strategies for calibrating the quantization scale, but both approaches struggle to effectively quantize the activation. The gray dotted lines represent the quantized values. Unlike layer-wise quantization, in channel-wise quantization, the quantized values are adapted to each channel. In group-wise quantization, the quantized values are adapted to groups, such as outliers or other channels. More detailed information about quantization granularity can be found in Appendix \ref{['appendix:quantization-granularity']}
  • Figure 4: Characteristics of activation outliers. (a) Comparison of dropping random values and dropping outlier values. (b) Two types of outliers are identified. These outliers often appear in specific channels or at specific pixels. We provide full activation matrix visualization in Appendix \ref{['appendix:sec:full-vis-of-act.']}
  • Figure 5: Characteristics of cross-attention scores. (a) The <start> token causes a peak near $1.0$(Left). Background pixels tend to have high attention scores for the <start> token (Right). (b) Unlike self-attention, the maximum values of cross-attention scores change more dynamically.
  • ...and 9 more figures