SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity

Zhenghao Gan; Yichen Bao; Yifei Liu; Chen Chen; Quan Chen; Minyi Guo

SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity

Zhenghao Gan, Yichen Bao, Yifei Liu, Chen Chen, Quan Chen, Minyi Guo

TL;DR

This work proposes SageSched, an efficient LLM scheduler that properly handles demand uncertainty and hybridity of inference workloads, and employs an uncertainty-aware scheduling policy that can yield the best overall efficiency given the request cost distributions.

Abstract

Efficient LLM inference scheduling is crucial for user experience.However, LLM inferences exhibit remarkable demand uncertainty (with unknown output length beforehand) and hybridity (being both compute and memory intensive). Existing LLM schedulers rely on simple heuristics or focus purely on compute resource, suffering suboptimal performance. In this work, we propose SageSched, an efficient LLM scheduler that properly handles demand uncertainty and hybridity of inference workloads.SageSched combines prompt contents with the past inference results to predict output-length distribution in a light-weight and also accurate manner.Meanwhile, it models the true service cost of an inference request with both compute and memory aspects considered.Finally, SageSched employs an uncertainty-aware scheduling policy that can yield the best overall efficiency given the request cost distributions.Testbed experiments over diverse setups confirm that SageSched can attain an efficiency improvement of over 28.7%.

SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity

TL;DR

Abstract

Paper Structure (19 sections, 13 figures)

This paper contains 19 sections, 13 figures.

Introduction
Background and Motivation
The LLM Scheduling Problem
Limitations of Existing LLM Schedulers
SageSched Design
Semantic-aware History-based Predictor
Resource-bound-based Cost Modeling
Uncertainty-aware Request Scheduling
Evaluation
Setup
End-to-end Experiment
Microscopic Deep Dive
Superiority of our Predictor Design
Superiority of our Cost Modeling Method
Superiority of our Scheduling Policy
...and 4 more sections

Figures (13)

Figure 1: Empirical evidences on the uncertainty and hybridity characteristics of LLM inferences' resource demands.
Figure 2: Examples elaborating the deficiencies of existing schedulers when confronting demand uncertainty and hybridity.
Figure 3: Overview of SageSched workflow.
Figure 4: Output-length distribution of a request can be better approximated by historical requests with a higher prompt similarity.
Figure 5: Measurements on the instantaneous resource bound, as well as the compute cost characteristics in decoding. All measurements are conducted on an H800 GPU with Qwen3-32B model.
...and 8 more figures

SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity

TL;DR

Abstract

SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity

Authors

TL;DR

Abstract

Table of Contents

Figures (13)