HADIS: Hybrid Adaptive Diffusion Model Serving for Efficient Text-to-Image Generation
Qizheng Yang, Tung-I Chen, Siyu Zhao, Ramesh K. Sitaraman, Hui Guan
TL;DR
HADIS tackles the high computation and latency costs of text-to-image diffusion models by introducing a hybrid serving architecture that jointly leverages a lightweight router and a discriminator within a two-stage cascade. Offline profiling builds a Pareto-optimal cascade configuration table, while an online MILP allocator selects configurations, placements, and batch sizes to satisfy latency constraints under varying workloads. The authors provide theoretical analysis comparing HY with Router-Only and Discriminator-Only baselines, and demonstrate empirically that two-model cascades suffice for effective scaling, achieving up to 35% improvement in response quality and substantial reductions in SLO violations on real and synthetic traces. Overall, HADIS enables scalable, latency-aware diffusion-model serving on commodity hardware by combining per-query routing decisions with principled resource management.
Abstract
Text-to-image diffusion models have achieved remarkable visual quality but incur high computational costs, making latency-aware, scalable deployment challenging. To address this, we advocate a hybrid architecture that achieves query awareness when serving diffusion models. Unlike existing query-aware serving systems that cascade lightweight and heavyweight models with a fixed configuration, our hybrid architecture first routes each query directly to a suitable model variant, then reroutes it to a cascaded heavyweight model only if necessary. We theoretically analyze conditions for the hybrid architecture to outperform non-hybrid alternatives in latency and response quality. Building on this architecture, we design HADIS, a hybrid serving system for latency-aware diffusion models that jointly optimizes cascade model selection, query routing, and resource allocation. To reduce the complexity of resource management, HADIS uses an offline profiling phase to produce a Pareto-optimal cascade configuration table. At runtime, HADIS selects the best cascade configuration and GPU allocation given latency and workload constraints. Empirical evaluations on real-world traces demonstrate that HADIS improves response quality by up to 35% while reducing latency violation rates by 2.7-45$\times$ compared to state-of-the-art model serving systems.
