MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

Tianchen Zhao; Xuefei Ning; Tongcheng Fang; Enshu Liu; Guyue Huang; Zinan Lin; Shengen Yan; Guohao Dai; Yu Wang

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

Tianchen Zhao, Xuefei Ning, Tongcheng Fang, Enshu Liu, Guyue Huang, Zinan Lin, Shengen Yan, Guohao Dai, Yu Wang

TL;DR

This work tackles the memory and latency barriers of few-step text-to-image diffusion by proposing MixDQ, a memory-efficient mixed-precision quantization framework. It combines BOS-aware quantization for text embeddings, a metric-decoupled sensitivity analysis to separately optimize content and quality, and an integer-programming-based bit-width allocator to navigate the $b in \\{2,4,8\\}$ budget space. Across extensive COCO-based evaluations and hardware profiling, MixDQ achieves $W8A8$ quantization with negligible performance loss and substantial memory (≈3×) and latency (~1.5×) improvements compared with FP16, while maintaining or approaching image-text alignment. These contributions enable practical deployment of few-step diffusion models on constrained devices and offer a scalable blueprint for quantization in other generative architectures.

Abstract

Diffusion models have achieved significant visual generation quality. However, their significant computational and memory costs pose challenge for their application on resource-constrained mobile devices or even desktop GPUs. Recent few-step diffusion models reduces the inference time by reducing the denoising steps. However, their memory consumptions are still excessive. The Post Training Quantization (PTQ) replaces high bit-width FP representation with low-bit integer values (INT4/8) , which is an effective and efficient technique to reduce the memory cost. However, when applying to few-step diffusion models, existing quantization methods face challenges in preserving both the image quality and text alignment. To address this issue, we propose an mixed-precision quantization framework - MixDQ. Firstly, We design specialized BOS-aware quantization method for highly sensitive text embedding quantization. Then, we conduct metric-decoupled sensitivity analysis to measure the sensitivity of each layer. Finally, we develop an integer-programming-based method to conduct bit-width allocation. While existing quantization methods fall short at W8A8, MixDQ could achieve W8A8 without performance loss, and W4A8 with negligible visual degradation. Compared with FP16, we achieve 3-4x reduction in model size and memory cost, and 1.45x latency speedup.

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

TL;DR

budget space. Across extensive COCO-based evaluations and hardware profiling, MixDQ achieves

quantization with negligible performance loss and substantial memory (≈3×) and latency (~1.5×) improvements compared with FP16, while maintaining or approaching image-text alignment. These contributions enable practical deployment of few-step diffusion models on constrained devices and offer a scalable blueprint for quantization in other generative architectures.

Abstract

Paper Structure (24 sections, 2 equations, 16 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 2 equations, 16 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Methods
BOS-aware Text Embedding Quantization
Metric-Decoupled Sensitivity Analysis
Integer Programming Bit-width Allocation
Experiments
Experimental Settings
Performance and Efficiency Comparison
Hardware Resource Savings
Analysis
Analysis of Paretor Frontier
Analysis of BOS-aware Quantization
Ablation Studies
Analysis of Quantization Method
...and 9 more sections

Figures (16)

Figure 1: The effectiveness of MixDQ. Left: MixDQ preserves both image quality and image-text alignment. Right: The efficiency improvements of MixDQ.
Figure 2: Insightful findings for few-step text-to-image diffusion quantization. Left: The layer-sensitivity distribution has a "long-tail" characteristic. Right: Quantization affects both image quality and content.
Figure 3: Framework of the proposed mixed-precision quantization method: MixDQ. It consists of three key components, the BOS-aware quantization addresses the highly sensitive text embedding, the metric-decoupled scheme improves sensitivity analysis, and the integer programming acquires the optimal bit-width allocation.
Figure 4: Illustration of BOS-aware Quantization. Left: the first token has a significantly larger value than the others. Right: Since BOS token features remain the same for different prompts, we skip quantizing them and pre-compute them offline.
Figure 5: Decoupling metrics and layers to separate the influence on image quality and content. Left: Existing SQNR-based sensitivity analysis needs improving. Right: SQNR's problem of overemphasizing content change.
...and 11 more figures

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

TL;DR

Abstract

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

Authors

TL;DR

Abstract

Table of Contents

Figures (16)