Table of Contents
Fetching ...

EconoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving

Haiying Shen, Tanmoy Sen

TL;DR

EconoServe tackles dual-resource optimization in LLM serving by ensuring that both GPU compute and KVC memory are used per iteration while honoring SLOs. It introduces three core mechanisms—KVCPipe for KVC sharing, SyncDecoupled with time-synced batching to decouple prompt processing from generation, and a robust Ordering strategy guided by SLO, KVC, and RL/prompt length—and relies on RL prediction to group GTs by predicted RL. Trace-based experiments demonstrate up to 4x throughput gains, up to 91% reductions in job completion time, and up to 91% higher SLO satisfaction compared with vLLM, along with significant GPU reductions versus DistServe. The work offers a practical path to cost-efficient, scalable LLM inference by maximizing resource utilization without requiring high-bandwidth interconnects or multiple model replicas.

Abstract

As Large Language Models (LLMs) continue to grow, reducing costs and alleviating GPU demands has become increasingly critical. However, existing schedulers primarily target either GPU compute or Key-Value Cache (KVC) utilization, failing to fully optimize both GPU compute and KVC usage during each iteration or guarantee timely KVC allocations when needed. To address these challenges, we conducted a trace-based experimental analysis and made insightful observations, leading to the design of a system called EconoServe. EconoServe maximizes multi-resource utilization while ensuring service-level objective (SLO) guarantees in LLM serving. To enable adding prompts to a batch to maximize GPU utilization in each iteration, EconoServe maintains separate waiting queues for prompt processing tasks (PTs) and generation tasks (GTs). It batches GTs with the same predicted response lengths (RL) to save scheduling time and allocates KVC space for the predicted RL to avoid KVC allocation failures. It further has a novel KVC pipelining method, allowing sharing allocated but unused KVC space to enhance KVC utilization. In addition, it prioritizes queued requests that occupy more KVC to release KVC earlier and satisfy request service-level-objective (SLO). Experimental results demonstrate that EconoServe increases throughput by up to 4$\times$ with the same level of latency, generates up to 91\% lower job completion time and up to 91\% higher SLO satisfaction ratio compared to vLLM. It also reduces the number of GPUs used in DistServe by up to 78\% while maintaining the same level of goodput.

EconoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving

TL;DR

EconoServe tackles dual-resource optimization in LLM serving by ensuring that both GPU compute and KVC memory are used per iteration while honoring SLOs. It introduces three core mechanisms—KVCPipe for KVC sharing, SyncDecoupled with time-synced batching to decouple prompt processing from generation, and a robust Ordering strategy guided by SLO, KVC, and RL/prompt length—and relies on RL prediction to group GTs by predicted RL. Trace-based experiments demonstrate up to 4x throughput gains, up to 91% reductions in job completion time, and up to 91% higher SLO satisfaction compared with vLLM, along with significant GPU reductions versus DistServe. The work offers a practical path to cost-efficient, scalable LLM inference by maximizing resource utilization without requiring high-bandwidth interconnects or multiple model replicas.

Abstract

As Large Language Models (LLMs) continue to grow, reducing costs and alleviating GPU demands has become increasingly critical. However, existing schedulers primarily target either GPU compute or Key-Value Cache (KVC) utilization, failing to fully optimize both GPU compute and KVC usage during each iteration or guarantee timely KVC allocations when needed. To address these challenges, we conducted a trace-based experimental analysis and made insightful observations, leading to the design of a system called EconoServe. EconoServe maximizes multi-resource utilization while ensuring service-level objective (SLO) guarantees in LLM serving. To enable adding prompts to a batch to maximize GPU utilization in each iteration, EconoServe maintains separate waiting queues for prompt processing tasks (PTs) and generation tasks (GTs). It batches GTs with the same predicted response lengths (RL) to save scheduling time and allocates KVC space for the predicted RL to avoid KVC allocation failures. It further has a novel KVC pipelining method, allowing sharing allocated but unused KVC space to enhance KVC utilization. In addition, it prioritizes queued requests that occupy more KVC to release KVC earlier and satisfy request service-level-objective (SLO). Experimental results demonstrate that EconoServe increases throughput by up to 4 with the same level of latency, generates up to 91\% lower job completion time and up to 91\% higher SLO satisfaction ratio compared to vLLM. It also reduces the number of GPUs used in DistServe by up to 78\% while maintaining the same level of goodput.

Paper Structure

This paper contains 17 sections, 15 figures, 2 tables, 1 algorithm.

Figures (15)

  • Figure 1: Comparison of different schedulers.
  • Figure 2: Num. of requests in a same-RL group in experiment.
  • Figure 3: Different combinations.
  • Figure 4: Impact of adding padding to the predicted response length (RL).
  • Figure 5: Impact of RL misprediction.
  • ...and 10 more figures