SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models
Jiaji Zhang, Ruichao Sun, Hailiang Zhao, Jiaju Wu, Peng Chen, Hao Li, Yuying Liu, Kingsum Chow, Gang Xiong, Shuiguang Deng
TL;DR
SegQuant tackles the deployment challenge of diffusion models by introducing a semantics-aware, graph-guided PTQ framework that avoids architecture-specific heuristics and aligns with modern compiler pipelines. It introduces SegLinear to perform segment-wise quantization based on static computation-graph patterns and DualScale to preserve polarity of activations with separate scales for negative and non-negative parts, all while maintaining standard GEMM and CUDA epilogue flows. The approach demonstrates improved image fidelity and perceptual metrics across DiT-based backbones and generalizes to other architectures, with strong gains in FID, LPIPS, PSNR, and CLIP-based scores and minimal hardware overhead. This work enables practical, scalable quantization of diffusion models in real-world deployment pipelines, reducing compute and memory demands without retraining or data access.
Abstract
Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments. Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data. However, existing PTQ methods for diffusion models often rely on architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines. To address these limitations, we propose SegQuant, a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility. SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations, which is crucial for maintaining visual fidelity in generated outputs. SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.
