Table of Contents
Fetching ...

SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models

Jiaji Zhang, Ruichao Sun, Hailiang Zhao, Jiaju Wu, Peng Chen, Hao Li, Yuying Liu, Kingsum Chow, Gang Xiong, Shuiguang Deng

TL;DR

SegQuant tackles the deployment challenge of diffusion models by introducing a semantics-aware, graph-guided PTQ framework that avoids architecture-specific heuristics and aligns with modern compiler pipelines. It introduces SegLinear to perform segment-wise quantization based on static computation-graph patterns and DualScale to preserve polarity of activations with separate scales for negative and non-negative parts, all while maintaining standard GEMM and CUDA epilogue flows. The approach demonstrates improved image fidelity and perceptual metrics across DiT-based backbones and generalizes to other architectures, with strong gains in FID, LPIPS, PSNR, and CLIP-based scores and minimal hardware overhead. This work enables practical, scalable quantization of diffusion models in real-world deployment pipelines, reducing compute and memory demands without retraining or data access.

Abstract

Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments. Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data. However, existing PTQ methods for diffusion models often rely on architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines. To address these limitations, we propose SegQuant, a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility. SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations, which is crucial for maintaining visual fidelity in generated outputs. SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.

SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models

TL;DR

SegQuant tackles the deployment challenge of diffusion models by introducing a semantics-aware, graph-guided PTQ framework that avoids architecture-specific heuristics and aligns with modern compiler pipelines. It introduces SegLinear to perform segment-wise quantization based on static computation-graph patterns and DualScale to preserve polarity of activations with separate scales for negative and non-negative parts, all while maintaining standard GEMM and CUDA epilogue flows. The approach demonstrates improved image fidelity and perceptual metrics across DiT-based backbones and generalizes to other architectures, with strong gains in FID, LPIPS, PSNR, and CLIP-based scores and minimal hardware overhead. This work enables practical, scalable quantization of diffusion models in real-world deployment pipelines, reducing compute and memory demands without retraining or data access.

Abstract

Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments. Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data. However, existing PTQ methods for diffusion models often rely on architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines. To address these limitations, we propose SegQuant, a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility. SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations, which is crucial for maintaining visual fidelity in generated outputs. SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.

Paper Structure

This paper contains 33 sections, 6 equations, 48 figures, 11 tables, 1 algorithm.

Figures (48)

  • Figure 1: SegQuant framework follows a top-down workflow that effectively integrates existing quantization techniques with our novel contributions.
  • Figure 2: Structural overview of the DiT diffusion model, highlighting latent-related modules (left) and time-related modules (right).
  • Figure 3: Frobenius norm of error $\|\boldsymbol{\Delta \epsilon_t}\|_F$ over timesteps for INTW8A8 vs. FP16 across linear layers.
  • Figure 4: Visualization of weights in AdaNorm within the TimeEmbedding module. The distribution reveals distinct semantic patterns.
  • Figure 5: SegLinear reveals two semantic patterns in the weight matrix that guide quantization.
  • ...and 43 more figures