SLOs-Serve: Optimized Serving of Multi-SLO LLMs
Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, Phillip B. Gibbons
TL;DR
The paper tackles the challenge of serving multi-stage LLM requests under fine-grained, application-specific SLOs. It introduces SLOs-Serve, a DP-based scheduler that optimizes token allocations across prefill and decode stages, leveraging chunked prefill, adaptive speculative decoding, and soft admission control to guarantee SLO attainment for admitted requests. A holistic system design combines burst-resilient scheduling with soft admission and multi-replica request routing, backed by a Roofline-inspired performance model and dynamic batch-size tuning. Empirical evaluation across six application scenarios shows substantial capacity improvements over state-of-the-art baselines (average ~2.2x), with robust burst handling and near-linear scaling in multi-replica settings, underscoring the practical impact for heterogeneous LLM workloads.
Abstract
This paper introduces SLOs-Serve, a system designed for serving multi-stage large language model (LLM) requests with application- and stage-specific service level objectives (SLOs). The key idea behind SLOs-Serve is to customize the allocation of tokens to meet these SLO requirements. SLOs-Serve uses a multi-SLO dynamic programming-based algorithm to continuously optimize token allocations under SLO constraints by exploring the full design space of chunked prefill and (optional) speculative decoding. Leveraging this resource planning algorithm, SLOs-Serve effectively supports multi-SLOs and multi-replica serving with dynamic request routing while being resilient to bursty arrivals. Our evaluation across 6 LLM application scenarios (including summarization, coding, chatbot, tool calling, and reasoning) demonstrates that SLOs-Serve improves per-GPU serving capacity by 2.2x on average compared to prior state-of-the-art systems.
