SLOs-Serve: Optimized Serving of Multi-SLO LLMs

Siyuan Chen; Zhipeng Jia; Samira Khan; Arvind Krishnamurthy; Phillip B. Gibbons

SLOs-Serve: Optimized Serving of Multi-SLO LLMs

Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, Phillip B. Gibbons

TL;DR

The paper tackles the challenge of serving multi-stage LLM requests under fine-grained, application-specific SLOs. It introduces SLOs-Serve, a DP-based scheduler that optimizes token allocations across prefill and decode stages, leveraging chunked prefill, adaptive speculative decoding, and soft admission control to guarantee SLO attainment for admitted requests. A holistic system design combines burst-resilient scheduling with soft admission and multi-replica request routing, backed by a Roofline-inspired performance model and dynamic batch-size tuning. Empirical evaluation across six application scenarios shows substantial capacity improvements over state-of-the-art baselines (average ~2.2x), with robust burst handling and near-linear scaling in multi-replica settings, underscoring the practical impact for heterogeneous LLM workloads.

Abstract

This paper introduces SLOs-Serve, a system designed for serving multi-stage large language model (LLM) requests with application- and stage-specific service level objectives (SLOs). The key idea behind SLOs-Serve is to customize the allocation of tokens to meet these SLO requirements. SLOs-Serve uses a multi-SLO dynamic programming-based algorithm to continuously optimize token allocations under SLO constraints by exploring the full design space of chunked prefill and (optional) speculative decoding. Leveraging this resource planning algorithm, SLOs-Serve effectively supports multi-SLOs and multi-replica serving with dynamic request routing while being resilient to bursty arrivals. Our evaluation across 6 LLM application scenarios (including summarization, coding, chatbot, tool calling, and reasoning) demonstrates that SLOs-Serve improves per-GPU serving capacity by 2.2x on average compared to prior state-of-the-art systems.

SLOs-Serve: Optimized Serving of Multi-SLO LLMs

TL;DR

Abstract

SLOs-Serve: Optimized Serving of Multi-SLO LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)