Table of Contents
Fetching ...

Revisiting Service Level Objectives and System Level Metrics in Large Language Model Serving

Zhibin Wang, Shipeng Li, Yuhang Zhou, Xue Li, Zhonghui Zhang, Nguyen Cam-Tu, Rong Gu, Chen Tian, Guihai Chen, Sheng Zhong

TL;DR

This work addresses the mismatch between throughput-focused optimization and user experience in online LLM serving by introducing a token-level SLO framework with token deadlines $d_i$ and a unified metric called smooth goodput. It shows that existing metrics can mislead optimization (e.g., via output-delay tricks or abandoning missed requests) and that a per-token deadline aligned with user processing speed, $d_i = V \times i$, better captures user experience. The authors formalize SLO attainment and goodput, augment them with user idle latency $l_r$ and a benefit function, and demonstrate through experiments on LLaMA-3.1-8B and Qwen-2.5-14B that smooth goodput reveals optimal operating points and reduces detrimental migrations/preemptions in disaggregated architectures. The study provides practical guidance for designing LLM serving systems that balance throughput and user experience across workloads, and outlines future directions for semantic-aware and adaptive SLOs.

Abstract

User experience is a critical factor Large Language Model (LLM) serving systems must consider, where service level objectives (SLOs) considering the experience of individual requests and system level metrics (SLMs) considering the overall system performance are two key performance measures. However, we observe two notable issues in existing metrics: 1) manually delaying the delivery of some tokens can improve SLOs, and 2) actively abandoning requests that do not meet SLOs can improve SLMs, both of which are counterintuitive. In this paper, we revisit SLOs and SLMs in LLM serving, and propose a new SLO that aligns with user experience. Based on the SLO, we propose a comprehensive metric framework called smooth goodput, which integrates SLOs and SLMs to reflect the nature of user experience in LLM serving. Through this unified framework, we reassess the performance of different LLM serving systems under multiple workloads. Evaluation results show that our metric framework provides a more comprehensive view of token delivery and request processing, and effectively captures the optimal point of user experience and system performance with different serving strategies.

Revisiting Service Level Objectives and System Level Metrics in Large Language Model Serving

TL;DR

This work addresses the mismatch between throughput-focused optimization and user experience in online LLM serving by introducing a token-level SLO framework with token deadlines and a unified metric called smooth goodput. It shows that existing metrics can mislead optimization (e.g., via output-delay tricks or abandoning missed requests) and that a per-token deadline aligned with user processing speed, , better captures user experience. The authors formalize SLO attainment and goodput, augment them with user idle latency and a benefit function, and demonstrate through experiments on LLaMA-3.1-8B and Qwen-2.5-14B that smooth goodput reveals optimal operating points and reduces detrimental migrations/preemptions in disaggregated architectures. The study provides practical guidance for designing LLM serving systems that balance throughput and user experience across workloads, and outlines future directions for semantic-aware and adaptive SLOs.

Abstract

User experience is a critical factor Large Language Model (LLM) serving systems must consider, where service level objectives (SLOs) considering the experience of individual requests and system level metrics (SLMs) considering the overall system performance are two key performance measures. However, we observe two notable issues in existing metrics: 1) manually delaying the delivery of some tokens can improve SLOs, and 2) actively abandoning requests that do not meet SLOs can improve SLMs, both of which are counterintuitive. In this paper, we revisit SLOs and SLMs in LLM serving, and propose a new SLO that aligns with user experience. Based on the SLO, we propose a comprehensive metric framework called smooth goodput, which integrates SLOs and SLMs to reflect the nature of user experience in LLM serving. Through this unified framework, we reassess the performance of different LLM serving systems under multiple workloads. Evaluation results show that our metric framework provides a more comprehensive view of token delivery and request processing, and effectively captures the optimal point of user experience and system performance with different serving strategies.

Paper Structure

This paper contains 22 sections, 6 equations, 11 figures.

Figures (11)

  • Figure 1: Examples illustrating the limitations of TBT and TPOT.
  • Figure 2: Token generation timeline in LLM serving systems and its impact on user experience. The red area indicates affected user experience, while the green area indicates good user experience. The blue line represents the token generation timeline and each dot represents one token. The total timeline can be devided into three parts: ① the time to receive the first token (TTFT), ② the time when the user is waiting for the next token, and ③ the time when the user is consuming the delivered information.
  • Figure 3: Existing SLOs of LLM Serving. Note that this figure ignores the difference between token generation from the LLM and its delivery to users.
  • Figure 4: An illustration of iteration scheduling strategies.
  • Figure 5: Performance of vLLM under different QPS, where CP denotes chunked prefills adoption.
  • ...and 6 more figures