Revisiting Service Level Objectives and System Level Metrics in Large Language Model Serving
Zhibin Wang, Shipeng Li, Yuhang Zhou, Xue Li, Zhonghui Zhang, Nguyen Cam-Tu, Rong Gu, Chen Tian, Guihai Chen, Sheng Zhong
TL;DR
This work addresses the mismatch between throughput-focused optimization and user experience in online LLM serving by introducing a token-level SLO framework with token deadlines $d_i$ and a unified metric called smooth goodput. It shows that existing metrics can mislead optimization (e.g., via output-delay tricks or abandoning missed requests) and that a per-token deadline aligned with user processing speed, $d_i = V \times i$, better captures user experience. The authors formalize SLO attainment and goodput, augment them with user idle latency $l_r$ and a benefit function, and demonstrate through experiments on LLaMA-3.1-8B and Qwen-2.5-14B that smooth goodput reveals optimal operating points and reduces detrimental migrations/preemptions in disaggregated architectures. The study provides practical guidance for designing LLM serving systems that balance throughput and user experience across workloads, and outlines future directions for semantic-aware and adaptive SLOs.
Abstract
User experience is a critical factor Large Language Model (LLM) serving systems must consider, where service level objectives (SLOs) considering the experience of individual requests and system level metrics (SLMs) considering the overall system performance are two key performance measures. However, we observe two notable issues in existing metrics: 1) manually delaying the delivery of some tokens can improve SLOs, and 2) actively abandoning requests that do not meet SLOs can improve SLMs, both of which are counterintuitive. In this paper, we revisit SLOs and SLMs in LLM serving, and propose a new SLO that aligns with user experience. Based on the SLO, we propose a comprehensive metric framework called smooth goodput, which integrates SLOs and SLMs to reflect the nature of user experience in LLM serving. Through this unified framework, we reassess the performance of different LLM serving systems under multiple workloads. Evaluation results show that our metric framework provides a more comprehensive view of token delivery and request processing, and effectively captures the optimal point of user experience and system performance with different serving strategies.
