Prompt-Aware Scheduling for Efficient Text-to-Image Inferencing System
Shubham Agarwal, Saud Iqbal, Subrata Mitra
TL;DR
This work tackles the problem of efficiently serving text-to-image diffusion models under high load, where input prompts heavily influence image quality and model-switching overheads hurt latency. It introduces a prompt-aware scheduling system that runs the same diffusion model on all GPUs with ApproxC variants to trade latency and quality, supported by an optimization that allocates model instances across $K$ values and their load fractions to preserve quality. A micro-level prompt-to-$K$ mapping, a $K$ to $K'$ redirection strategy minimizing the quality degradation $\\mathcal{D}_Q$, and a route-and-batch query dispatcher together enable high throughput with low SLO violations. The architecture includes a Resource Controller with components like the Model Cache Assigner, Optimal-$K$ Predictor, and K-to-K' Route Planner, and evaluation demonstrates sustained quality above $90\%$ and SLO violations below $5\%$, outperforming baselines by up to 10\% in quality and 40\% in throughput, with up to 10x reductions in latency violations.
Abstract
Traditional ML models utilize controlled approximations during high loads, employing faster, but less accurate models in a process called accuracy scaling. However, this method is less effective for generative text-to-image models due to their sensitivity to input prompts and performance degradation caused by large model loading overheads. This work introduces a novel text-to-image inference system that optimally matches prompts across multiple instances of the same model operating at various approximation levels to deliver high-quality images under high loads and fixed budgets.
