Table of Contents
Fetching ...

SLOs-Serve: Optimized Serving of Multi-SLO LLMs

Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, Phillip B. Gibbons

TL;DR

The paper tackles the challenge of serving multi-stage LLM requests under fine-grained, application-specific SLOs. It introduces SLOs-Serve, a DP-based scheduler that optimizes token allocations across prefill and decode stages, leveraging chunked prefill, adaptive speculative decoding, and soft admission control to guarantee SLO attainment for admitted requests. A holistic system design combines burst-resilient scheduling with soft admission and multi-replica request routing, backed by a Roofline-inspired performance model and dynamic batch-size tuning. Empirical evaluation across six application scenarios shows substantial capacity improvements over state-of-the-art baselines (average ~2.2x), with robust burst handling and near-linear scaling in multi-replica settings, underscoring the practical impact for heterogeneous LLM workloads.

Abstract

This paper introduces SLOs-Serve, a system designed for serving multi-stage large language model (LLM) requests with application- and stage-specific service level objectives (SLOs). The key idea behind SLOs-Serve is to customize the allocation of tokens to meet these SLO requirements. SLOs-Serve uses a multi-SLO dynamic programming-based algorithm to continuously optimize token allocations under SLO constraints by exploring the full design space of chunked prefill and (optional) speculative decoding. Leveraging this resource planning algorithm, SLOs-Serve effectively supports multi-SLOs and multi-replica serving with dynamic request routing while being resilient to bursty arrivals. Our evaluation across 6 LLM application scenarios (including summarization, coding, chatbot, tool calling, and reasoning) demonstrates that SLOs-Serve improves per-GPU serving capacity by 2.2x on average compared to prior state-of-the-art systems.

SLOs-Serve: Optimized Serving of Multi-SLO LLMs

TL;DR

The paper tackles the challenge of serving multi-stage LLM requests under fine-grained, application-specific SLOs. It introduces SLOs-Serve, a DP-based scheduler that optimizes token allocations across prefill and decode stages, leveraging chunked prefill, adaptive speculative decoding, and soft admission control to guarantee SLO attainment for admitted requests. A holistic system design combines burst-resilient scheduling with soft admission and multi-replica request routing, backed by a Roofline-inspired performance model and dynamic batch-size tuning. Empirical evaluation across six application scenarios shows substantial capacity improvements over state-of-the-art baselines (average ~2.2x), with robust burst handling and near-linear scaling in multi-replica settings, underscoring the practical impact for heterogeneous LLM workloads.

Abstract

This paper introduces SLOs-Serve, a system designed for serving multi-stage large language model (LLM) requests with application- and stage-specific service level objectives (SLOs). The key idea behind SLOs-Serve is to customize the allocation of tokens to meet these SLO requirements. SLOs-Serve uses a multi-SLO dynamic programming-based algorithm to continuously optimize token allocations under SLO constraints by exploring the full design space of chunked prefill and (optional) speculative decoding. Leveraging this resource planning algorithm, SLOs-Serve effectively supports multi-SLOs and multi-replica serving with dynamic request routing while being resilient to bursty arrivals. Our evaluation across 6 LLM application scenarios (including summarization, coding, chatbot, tool calling, and reasoning) demonstrates that SLOs-Serve improves per-GPU serving capacity by 2.2x on average compared to prior state-of-the-art systems.

Paper Structure

This paper contains 45 sections, 13 equations, 15 figures, 5 tables, 3 algorithms.

Figures (15)

  • Figure 1: Serving Capacity comparison for LLM applications with heterogeneous SLOs, on a server with 4 A100s (experimental details in Tables \ref{['tab:scenario']} and \ref{['tab:datasets']} of §\ref{['sec:eval']}). N/S: not supported.
  • Figure 2: Throughput-latency trade-off for batching. Each data point--Green for the OPT-7B model zhang2022opt on A100, Red for the OPT-13B model zhang2022opt on H100--is a batch executed in SLOs-Serve's scheduling with both prefill and decode tokens.
  • Figure 3: Comparison between different co-located scheduling approaches.
  • Figure 4: Capacity in DistServe with different prefill (PF), decode (DCD) device ratios when serving the OPT-13B model with H100 GPUs. The capacity is normalized to 1 PF:1 DCD.
  • Figure 5: Illustration of the scheduling algorithms
  • ...and 10 more figures