DiffPro: Joint Timestep and Layer-Wise Precision Optimization for Efficient Diffusion Inference
Farhana Amin, Sabiha Afroz, Kanchon Gharami, Mona Moghadampanah, Dimitrios S. Nikolopoulos
TL;DR
DiffPro tackles the high cost of diffusion inference by jointly optimizing timestep budgets and per-layer precision in Diffusion Transformers under a real hardware budget. It introduces a manifold-aware sensitivity metric, Dynamic Activation Quantization, and drift-guided timestep pruning, all evaluated with hardware-faithful, integer-kernel benchmarks. The approach yields up to 6.25x model compression, around 50% fewer timesteps, and up to 2.8x faster inference while maintaining competitive image fidelity on standard benchmarks. This work enables deployable, energy-aware diffusion inference without retraining, with broad applicability to both DiTs and U-Nets. The key contributions include a post-training joint optimization framework, a data-driven layer-sensitivity-prioritized precision plan, and a practical pruning strategy guided by teacher-student drift.
Abstract
Diffusion models produce high quality images but inference is costly due to many denoising steps and heavy matrix operations. We present DiffPro, a post-training, hardware-faithful framework that works with the exact integer kernels used in deployment and jointly tunes timesteps and per-layer precision in Diffusion Transformers (DiTs) to reduce latency and memory without any training. DiffPro combines three parts: a manifold-aware sensitivity metric to allocate weight bits, dynamic activation quantization to stabilize activations across timesteps, and a budgeted timestep selector guided by teacher-student drift. In experiments DiffPro achieves up to 6.25x model compression, fifty percent fewer timesteps, and 2.8x faster inference with Delta FID <= 10 on standard benchmarks, demonstrating practical efficiency gains. DiffPro unifies step reduction and precision planning into a single budgeted deployable plan for real-time energy-aware diffusion inference.
