Table of Contents
Fetching ...

DiffPro: Joint Timestep and Layer-Wise Precision Optimization for Efficient Diffusion Inference

Farhana Amin, Sabiha Afroz, Kanchon Gharami, Mona Moghadampanah, Dimitrios S. Nikolopoulos

TL;DR

DiffPro tackles the high cost of diffusion inference by jointly optimizing timestep budgets and per-layer precision in Diffusion Transformers under a real hardware budget. It introduces a manifold-aware sensitivity metric, Dynamic Activation Quantization, and drift-guided timestep pruning, all evaluated with hardware-faithful, integer-kernel benchmarks. The approach yields up to 6.25x model compression, around 50% fewer timesteps, and up to 2.8x faster inference while maintaining competitive image fidelity on standard benchmarks. This work enables deployable, energy-aware diffusion inference without retraining, with broad applicability to both DiTs and U-Nets. The key contributions include a post-training joint optimization framework, a data-driven layer-sensitivity-prioritized precision plan, and a practical pruning strategy guided by teacher-student drift.

Abstract

Diffusion models produce high quality images but inference is costly due to many denoising steps and heavy matrix operations. We present DiffPro, a post-training, hardware-faithful framework that works with the exact integer kernels used in deployment and jointly tunes timesteps and per-layer precision in Diffusion Transformers (DiTs) to reduce latency and memory without any training. DiffPro combines three parts: a manifold-aware sensitivity metric to allocate weight bits, dynamic activation quantization to stabilize activations across timesteps, and a budgeted timestep selector guided by teacher-student drift. In experiments DiffPro achieves up to 6.25x model compression, fifty percent fewer timesteps, and 2.8x faster inference with Delta FID <= 10 on standard benchmarks, demonstrating practical efficiency gains. DiffPro unifies step reduction and precision planning into a single budgeted deployable plan for real-time energy-aware diffusion inference.

DiffPro: Joint Timestep and Layer-Wise Precision Optimization for Efficient Diffusion Inference

TL;DR

DiffPro tackles the high cost of diffusion inference by jointly optimizing timestep budgets and per-layer precision in Diffusion Transformers under a real hardware budget. It introduces a manifold-aware sensitivity metric, Dynamic Activation Quantization, and drift-guided timestep pruning, all evaluated with hardware-faithful, integer-kernel benchmarks. The approach yields up to 6.25x model compression, around 50% fewer timesteps, and up to 2.8x faster inference while maintaining competitive image fidelity on standard benchmarks. This work enables deployable, energy-aware diffusion inference without retraining, with broad applicability to both DiTs and U-Nets. The key contributions include a post-training joint optimization framework, a data-driven layer-sensitivity-prioritized precision plan, and a practical pruning strategy guided by teacher-student drift.

Abstract

Diffusion models produce high quality images but inference is costly due to many denoising steps and heavy matrix operations. We present DiffPro, a post-training, hardware-faithful framework that works with the exact integer kernels used in deployment and jointly tunes timesteps and per-layer precision in Diffusion Transformers (DiTs) to reduce latency and memory without any training. DiffPro combines three parts: a manifold-aware sensitivity metric to allocate weight bits, dynamic activation quantization to stabilize activations across timesteps, and a budgeted timestep selector guided by teacher-student drift. In experiments DiffPro achieves up to 6.25x model compression, fifty percent fewer timesteps, and 2.8x faster inference with Delta FID <= 10 on standard benchmarks, demonstrating practical efficiency gains. DiffPro unifies step reduction and precision planning into a single budgeted deployable plan for real-time energy-aware diffusion inference.

Paper Structure

This paper contains 25 sections, 8 equations, 22 figures, 3 tables, 2 algorithms.

Figures (22)

  • Figure 1: DiffPro Overview:Joint precision optimization for efficient diffusion inference. The workflow analyzes layers, derives manifold and PCA signals, ranks sensitivity to seed a bit plan, applies DAQ, prunes a 1000 step schedule with a protected tail, then quantizes and evaluates FID, latency, energy, and model size.
  • Figure 2: PCA for the top sensitive layers.
  • Figure 3: Activation distribution over timesteps 50-100 for one chosen layer.
  • Figure 4: Scatter:$\log_{10}\!\bigl(\mathrm{mean}\,\sum x^{2}\bigr)$ vs. $\mathrm{PCA\_sensitivity}$ (size/color $\approx k_{95}/d$).
  • Figure 5: CDF of the combined score.
  • ...and 17 more figures