Table of Contents
Fetching ...

HADIS: Hybrid Adaptive Diffusion Model Serving for Efficient Text-to-Image Generation

Qizheng Yang, Tung-I Chen, Siyu Zhao, Ramesh K. Sitaraman, Hui Guan

TL;DR

HADIS tackles the high computation and latency costs of text-to-image diffusion models by introducing a hybrid serving architecture that jointly leverages a lightweight router and a discriminator within a two-stage cascade. Offline profiling builds a Pareto-optimal cascade configuration table, while an online MILP allocator selects configurations, placements, and batch sizes to satisfy latency constraints under varying workloads. The authors provide theoretical analysis comparing HY with Router-Only and Discriminator-Only baselines, and demonstrate empirically that two-model cascades suffice for effective scaling, achieving up to 35% improvement in response quality and substantial reductions in SLO violations on real and synthetic traces. Overall, HADIS enables scalable, latency-aware diffusion-model serving on commodity hardware by combining per-query routing decisions with principled resource management.

Abstract

Text-to-image diffusion models have achieved remarkable visual quality but incur high computational costs, making latency-aware, scalable deployment challenging. To address this, we advocate a hybrid architecture that achieves query awareness when serving diffusion models. Unlike existing query-aware serving systems that cascade lightweight and heavyweight models with a fixed configuration, our hybrid architecture first routes each query directly to a suitable model variant, then reroutes it to a cascaded heavyweight model only if necessary. We theoretically analyze conditions for the hybrid architecture to outperform non-hybrid alternatives in latency and response quality. Building on this architecture, we design HADIS, a hybrid serving system for latency-aware diffusion models that jointly optimizes cascade model selection, query routing, and resource allocation. To reduce the complexity of resource management, HADIS uses an offline profiling phase to produce a Pareto-optimal cascade configuration table. At runtime, HADIS selects the best cascade configuration and GPU allocation given latency and workload constraints. Empirical evaluations on real-world traces demonstrate that HADIS improves response quality by up to 35% while reducing latency violation rates by 2.7-45$\times$ compared to state-of-the-art model serving systems.

HADIS: Hybrid Adaptive Diffusion Model Serving for Efficient Text-to-Image Generation

TL;DR

HADIS tackles the high computation and latency costs of text-to-image diffusion models by introducing a hybrid serving architecture that jointly leverages a lightweight router and a discriminator within a two-stage cascade. Offline profiling builds a Pareto-optimal cascade configuration table, while an online MILP allocator selects configurations, placements, and batch sizes to satisfy latency constraints under varying workloads. The authors provide theoretical analysis comparing HY with Router-Only and Discriminator-Only baselines, and demonstrate empirically that two-model cascades suffice for effective scaling, achieving up to 35% improvement in response quality and substantial reductions in SLO violations on real and synthetic traces. Overall, HADIS enables scalable, latency-aware diffusion-model serving on commodity hardware by combining per-query routing decisions with principled resource management.

Abstract

Text-to-image diffusion models have achieved remarkable visual quality but incur high computational costs, making latency-aware, scalable deployment challenging. To address this, we advocate a hybrid architecture that achieves query awareness when serving diffusion models. Unlike existing query-aware serving systems that cascade lightweight and heavyweight models with a fixed configuration, our hybrid architecture first routes each query directly to a suitable model variant, then reroutes it to a cascaded heavyweight model only if necessary. We theoretically analyze conditions for the hybrid architecture to outperform non-hybrid alternatives in latency and response quality. Building on this architecture, we design HADIS, a hybrid serving system for latency-aware diffusion models that jointly optimizes cascade model selection, query routing, and resource allocation. To reduce the complexity of resource management, HADIS uses an offline profiling phase to produce a Pareto-optimal cascade configuration table. At runtime, HADIS selects the best cascade configuration and GPU allocation given latency and workload constraints. Empirical evaluations on real-world traces demonstrate that HADIS improves response quality by up to 35% while reducing latency violation rates by 2.7-45 compared to state-of-the-art model serving systems.

Paper Structure

This paper contains 19 sections, 10 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: The hybrid architecture for text-to-image diffusion model serving in HADIS. $\theta$: router threshold; $\tau$: discriminator threshold.
  • Figure 2: Quality-latency tradeoffs of different diffusion model variants (left) and model cascades (right). Quality is measured by Fréchet Inception Distance 10.5555/3295222.3295408 (FID, lower is better). The cascades are constructed by the model variants colored in orange in the left panel. The orange dashed lines partition latency into three regimes where each regime has a distinct model cascade that yields the frontier points.
  • Figure 3: Limitations of fixed model cascades. In both panels, each point represents a distinct two-model cascade. Comparing between low load (left) and high load (right), the best cascade (circled in red) varies as the system workload and the tolerable SLO violation ratio change.
  • Figure 4: Feasibility region for the discriminator of HY architecture so that HY outperforms RO. Higher $\alpha$ means smaller gaps between lightweight and heavyweight models.
  • Figure 5: Feasibility region of the router of HY architecture so that HY outperforms DO. With $\alpha$ increases, the region becomes larger.
  • ...and 7 more figures