Table of Contents
Fetching ...

Prompt-Aware Scheduling for Efficient Text-to-Image Inferencing System

Shubham Agarwal, Saud Iqbal, Subrata Mitra

TL;DR

This work tackles the problem of efficiently serving text-to-image diffusion models under high load, where input prompts heavily influence image quality and model-switching overheads hurt latency. It introduces a prompt-aware scheduling system that runs the same diffusion model on all GPUs with ApproxC variants to trade latency and quality, supported by an optimization that allocates model instances across $K$ values and their load fractions to preserve quality. A micro-level prompt-to-$K$ mapping, a $K$ to $K'$ redirection strategy minimizing the quality degradation $\\mathcal{D}_Q$, and a route-and-batch query dispatcher together enable high throughput with low SLO violations. The architecture includes a Resource Controller with components like the Model Cache Assigner, Optimal-$K$ Predictor, and K-to-K' Route Planner, and evaluation demonstrates sustained quality above $90\%$ and SLO violations below $5\%$, outperforming baselines by up to 10\% in quality and 40\% in throughput, with up to 10x reductions in latency violations.

Abstract

Traditional ML models utilize controlled approximations during high loads, employing faster, but less accurate models in a process called accuracy scaling. However, this method is less effective for generative text-to-image models due to their sensitivity to input prompts and performance degradation caused by large model loading overheads. This work introduces a novel text-to-image inference system that optimally matches prompts across multiple instances of the same model operating at various approximation levels to deliver high-quality images under high loads and fixed budgets.

Prompt-Aware Scheduling for Efficient Text-to-Image Inferencing System

TL;DR

This work tackles the problem of efficiently serving text-to-image diffusion models under high load, where input prompts heavily influence image quality and model-switching overheads hurt latency. It introduces a prompt-aware scheduling system that runs the same diffusion model on all GPUs with ApproxC variants to trade latency and quality, supported by an optimization that allocates model instances across values and their load fractions to preserve quality. A micro-level prompt-to- mapping, a to redirection strategy minimizing the quality degradation , and a route-and-batch query dispatcher together enable high throughput with low SLO violations. The architecture includes a Resource Controller with components like the Model Cache Assigner, Optimal- Predictor, and K-to-K' Route Planner, and evaluation demonstrates sustained quality above and SLO violations below , outperforming baselines by up to 10\% in quality and 40\% in throughput, with up to 10x reductions in latency violations.

Abstract

Traditional ML models utilize controlled approximations during high loads, employing faster, but less accurate models in a process called accuracy scaling. However, this method is less effective for generative text-to-image models due to their sensitivity to input prompts and performance degradation caused by large model loading overheads. This work introduces a novel text-to-image inference system that optimally matches prompts across multiple instances of the same model operating at various approximation levels to deliver high-quality images under high loads and fixed budgets.

Paper Structure

This paper contains 4 sections, 1 equation, 3 figures.

Figures (3)

  • Figure 1: Overview of system
  • Figure 2: Performance of system on Twitter trace
  • Figure 3: Aggregated