Table of Contents
Fetching ...

Q&C: When Quantization Meets Cache in Efficient Image Generation

Xin Ding, Xin Li, Haotong Qin, Zhibo Chen

TL;DR

This work addresses the efficiency-vs-accuracy trade-off when combining post-training quantization with cache in Diffusion Transformers for image generation. It identifies two major challenges—calibration efficacy degradation due to cache and amplified exposure bias—and proposes Temporal-Aware Parallel Clustering (TAP) and Variance Alignment (VC) to mitigate them. TAP dynamically selects informative calibration samples across time steps, while VC adaptively corrects sampling variance to reduce bias, achieving up to $12.7\times$ speedups with competitive visual fidelity on ImageNet and LSUN benchmarks. The approach offers a practical pathway to deploy high-performing DiTs in resource-constrained settings and invites further exploration across more generative models and quantization/cache configurations.

Abstract

Quantization and cache mechanisms are typically applied individually for efficient Diffusion Transformers (DiTs), each demonstrating notable potential for acceleration. However, the promoting effect of combining the two mechanisms on efficient generation remains under-explored. Through empirical investigation, we find that the combination of quantization and cache mechanisms for DiT is not straightforward, and two key challenges lead to severe catastrophic performance degradation: (i) the sample efficacy of calibration datasets in post-training quantization (PTQ) is significantly eliminated by cache operation; (ii) the combination of the above mechanisms introduces more severe exposure bias within sampling distribution, resulting in amplified error accumulation in the image generation process. In this work, we take advantage of these two acceleration mechanisms and propose a hybrid acceleration method by tackling the above challenges, aiming to further improve the efficiency of DiTs while maintaining excellent generation capability. Concretely, a temporal-aware parallel clustering (TAP) is designed to dynamically improve the sample selection efficacy for the calibration within PTQ for different diffusion steps. A variance compensation (VC) strategy is derived to correct the sampling distribution. It mitigates exposure bias through an adaptive correction factor generation. Extensive experiments have shown that our method has accelerated DiTs by 12.7x while preserving competitive generation capability. The code will be available at https://github.com/xinding-sys/Quant-Cache.

Q&C: When Quantization Meets Cache in Efficient Image Generation

TL;DR

This work addresses the efficiency-vs-accuracy trade-off when combining post-training quantization with cache in Diffusion Transformers for image generation. It identifies two major challenges—calibration efficacy degradation due to cache and amplified exposure bias—and proposes Temporal-Aware Parallel Clustering (TAP) and Variance Alignment (VC) to mitigate them. TAP dynamically selects informative calibration samples across time steps, while VC adaptively corrects sampling variance to reduce bias, achieving up to speedups with competitive visual fidelity on ImageNet and LSUN benchmarks. The approach offers a practical pathway to deploy high-performing DiTs in resource-constrained settings and invites further exploration across more generative models and quantization/cache configurations.

Abstract

Quantization and cache mechanisms are typically applied individually for efficient Diffusion Transformers (DiTs), each demonstrating notable potential for acceleration. However, the promoting effect of combining the two mechanisms on efficient generation remains under-explored. Through empirical investigation, we find that the combination of quantization and cache mechanisms for DiT is not straightforward, and two key challenges lead to severe catastrophic performance degradation: (i) the sample efficacy of calibration datasets in post-training quantization (PTQ) is significantly eliminated by cache operation; (ii) the combination of the above mechanisms introduces more severe exposure bias within sampling distribution, resulting in amplified error accumulation in the image generation process. In this work, we take advantage of these two acceleration mechanisms and propose a hybrid acceleration method by tackling the above challenges, aiming to further improve the efficiency of DiTs while maintaining excellent generation capability. Concretely, a temporal-aware parallel clustering (TAP) is designed to dynamically improve the sample selection efficacy for the calibration within PTQ for different diffusion steps. A variance compensation (VC) strategy is derived to correct the sampling distribution. It mitigates exposure bias through an adaptive correction factor generation. Extensive experiments have shown that our method has accelerated DiTs by 12.7x while preserving competitive generation capability. The code will be available at https://github.com/xinding-sys/Quant-Cache.

Paper Structure

This paper contains 24 sections, 12 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Efficiency-versus-efficacy trade-off across different settings. Bubble size represents the ratio of relative speed-up to generative quality compared to the DDPM baseline at 250 timesteps. We compare various methods in terms of FID (top) and sFID (bottom) performance across 50, 100, and 250 timesteps. Our method consistently appears in the upper-left region across all settings, achieving maximum acceleration while preserving generative quality.
  • Figure 2: Cosine similarity analysis across time steps in DiT for calibration data. This visualization is based on a 250-step DDIM sampling process. Calibration data were collected both without (up) and with (bottom) cache; samples positioned further to the right represent data closer to the final step $x_0$. The heatmap reveals high similarity in calibration datasets when quantization meets cache, particularly in later diffusion stages. This observation motivates our calibration strategy, highlighting a clear requirement to reduce redundancy and improve efficacy.
  • Figure 3: Analysis of exposure bias in DiT models. The mean squared errors between predicted samples and ground truth samples are computed at each time step. While the exposure bias remains relatively stable in both the cached and quantized models compared to the 50-timestep DiT, a noticeable increase in exposure bias is observed when quantization meets cache, leading to accumulation during the generation process.
  • Figure 4: Comparision of the density distribution of the variance of 5000 samples from Imagenet across difference time steps. They illustrate the change in sample distribution variance at various time steps, shown for case without (top) and with (bottom) quant-cache. As the diffusion progresses, the variance of sample distribution starts to deviate towards Gassian white noise.
  • Figure 5: Image generations with our method on DiT. The image sizes are 256 $\times$ 256, with DiT (DDPM, 250 steps, top) and Ours (50 steps, bottom). For more visualizations, please refer to the supplementary materials.