Table of Contents
Fetching ...

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

Chengyi Nie, Rodrigo Fonseca, Zhenhua Liu

TL;DR

Aladdin tackles the problem of cost-efficient LLM serving under strict SLOs by proposing a co-adaptive scheduler that jointly places inference requests and scales cluster resources. It introduces performance models for batched prefill and decode phases, and uses them to predict minimal GPU resources and optimal worker configurations before placing requests via a multi-dimensional bin packing approach. The system supports both continuous-batching and split-phase inference, adapts to changing demand through heartbeat-based reconfiguration, and demonstrates up to 71% cost savings over strong baselines, with substantial improvements in SLO attainment and ATGT. The work provides practical, scalable techniques for cloud providers to run expensive LLM workloads more efficiently while preserving user experience.

Abstract

The demand for large language model (LLM) inference is gradually dominating the artificial intelligence workloads. Therefore, there is an urgent need for cost-efficient inference serving. Existing work focuses on single-worker optimization and lacks consideration of cluster-level management for both inference queries and computing resources. However, placing requests and managing resources without considering the query features easily causes SLO violations or resource underutilization. Providers are forced to allocate extra computing resources to guarantee user experience, leading to additional serving costs. In this paper we introduce Aladdin, a scheduler that co-adaptively places queries and scales computing resources with SLO awareness. For a stream of inference queries, Aladdin first predicts minimal computing resources and the corresponding serving workers' configuration required to fulfill the SLOs for all queries. Then, it places the queries to each serving worker according to the prefill and decode latency models of batched LLM inference to maximize each worker's utilization. Results show that Aladdin reduces the serving cost of a single model by up to 71% for the same SLO level compared with the baselines, which can be millions of dollars per year.

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

TL;DR

Aladdin tackles the problem of cost-efficient LLM serving under strict SLOs by proposing a co-adaptive scheduler that jointly places inference requests and scales cluster resources. It introduces performance models for batched prefill and decode phases, and uses them to predict minimal GPU resources and optimal worker configurations before placing requests via a multi-dimensional bin packing approach. The system supports both continuous-batching and split-phase inference, adapts to changing demand through heartbeat-based reconfiguration, and demonstrates up to 71% cost savings over strong baselines, with substantial improvements in SLO attainment and ATGT. The work provides practical, scalable techniques for cloud providers to run expensive LLM workloads more efficiently while preserving user experience.

Abstract

The demand for large language model (LLM) inference is gradually dominating the artificial intelligence workloads. Therefore, there is an urgent need for cost-efficient inference serving. Existing work focuses on single-worker optimization and lacks consideration of cluster-level management for both inference queries and computing resources. However, placing requests and managing resources without considering the query features easily causes SLO violations or resource underutilization. Providers are forced to allocate extra computing resources to guarantee user experience, leading to additional serving costs. In this paper we introduce Aladdin, a scheduler that co-adaptively places queries and scales computing resources with SLO awareness. For a stream of inference queries, Aladdin first predicts minimal computing resources and the corresponding serving workers' configuration required to fulfill the SLOs for all queries. Then, it places the queries to each serving worker according to the prefill and decode latency models of batched LLM inference to maximize each worker's utilization. Results show that Aladdin reduces the serving cost of a single model by up to 71% for the same SLO level compared with the baselines, which can be millions of dollars per year.
Paper Structure (26 sections, 10 equations, 14 figures, 3 tables)

This paper contains 26 sections, 10 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: The overall architecture of co-adaptive scheduling
  • Figure 2: CDF of output length for different prompt Lengths from ShareGPT and llama2-13b-chat-hf generated output.
  • Figure 3: An example illustrates the sub-optimal of JSQ for request placement.
  • Figure 4: Workflow of Aladdin with default continuous batching
  • Figure 5: Workflow of Aladdin with split-phase inference.
  • ...and 9 more figures