Table of Contents
Fetching ...

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

Chiheng Lou, Sheng Qi, Rui Kang, Yong Zhang, Chen Sun, Pengcheng Wang, Bingyang Liu, Xuanzhe Liu, Xin Jin

TL;DR

This work tackles the inefficiency of serving multiple LLMs on shared GPU clusters by exploiting long-term workload predictability to prewarm resources. It introduces WarmServe, which combines universal GPU workers with an evict-aware placement strategy, proactive prewarming during grace periods, and zero-overhead memory switching to enable one-for-many prewarming. The system achieves dramatic reductions in tail time-to-first-token (TTFT) and supports higher request throughput than competing autoscaling and GPU-sharing approaches, evidenced by up to 50.8x TTFT improvement and 2.5x more requests in real-world traces. The approach demonstrates that foresight in workload characteristics, coupled with careful memory and placement management, can substantially improve multi-LLM serving performance in production environments.

Abstract

Deploying multiple models within shared GPU clusters is promising for improving resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance, especially time-to-first-token (TTFT). We identify the root cause of such compromise as their unawareness of future workload characteristics. In contrast, recent analysis on real-world traces has shown the high periodicity and long-term predictability of LLM serving workloads. We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads. Based on universal GPU workers, we design and build WarmServe, a multi-LLM serving system that (1) mitigates cluster-wide prewarming interference by adopting an evict-aware model placement strategy, (2) prepares universal GPU workers in advance by proactive prewarming, and (3) manages GPU memory with a zero-overhead memory switching mechanism. Evaluation under real-world datasets shows that WarmServe improves TTFT by up to 50.8$\times$ compared to the state-of-the-art autoscaling-based system, while being capable of serving up to 2.5$\times$ more requests compared to the GPU-sharing system.

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

TL;DR

This work tackles the inefficiency of serving multiple LLMs on shared GPU clusters by exploiting long-term workload predictability to prewarm resources. It introduces WarmServe, which combines universal GPU workers with an evict-aware placement strategy, proactive prewarming during grace periods, and zero-overhead memory switching to enable one-for-many prewarming. The system achieves dramatic reductions in tail time-to-first-token (TTFT) and supports higher request throughput than competing autoscaling and GPU-sharing approaches, evidenced by up to 50.8x TTFT improvement and 2.5x more requests in real-world traces. The approach demonstrates that foresight in workload characteristics, coupled with careful memory and placement management, can substantially improve multi-LLM serving performance in production environments.

Abstract

Deploying multiple models within shared GPU clusters is promising for improving resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance, especially time-to-first-token (TTFT). We identify the root cause of such compromise as their unawareness of future workload characteristics. In contrast, recent analysis on real-world traces has shown the high periodicity and long-term predictability of LLM serving workloads. We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads. Based on universal GPU workers, we design and build WarmServe, a multi-LLM serving system that (1) mitigates cluster-wide prewarming interference by adopting an evict-aware model placement strategy, (2) prepares universal GPU workers in advance by proactive prewarming, and (3) manages GPU memory with a zero-overhead memory switching mechanism. Evaluation under real-world datasets shows that WarmServe improves TTFT by up to 50.8 compared to the state-of-the-art autoscaling-based system, while being capable of serving up to 2.5 more requests compared to the GPU-sharing system.

Paper Structure

This paper contains 22 sections, 6 equations, 17 figures, 1 table, 1 algorithm.

Figures (17)

  • Figure 1: Peak loads under 5-minute windows of the AzureConv dynamollm trace. Data smoothed using cubic spline interpolation.
  • Figure 2: Real and predicted peak loads under 5-minute windows and 8$\times$ Speed of the AzureConv dynamollm trace. The delta is shown in absolute values and the black dotted line represents its average value.
  • Figure 3: Example of cluster-wide prewarming interference.
  • Figure 4: WarmServe system overview.
  • Figure 5: Overview of GPU worker lifecycle in WarmServe.
  • ...and 12 more figures