OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

Youhe Jiang; Fangcheng Fu; Taiyi Wang; Guoliang He; Eiko Yoneki

OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, Eiko Yoneki

TL;DR

OServe addresses the dual challenges of spatial and temporal heterogeneity in LLM serving by co-optimizing workload assignment and heterogeneous model deployment through a two-level flow-network framework, complemented by fine-grained workload prediction and ad hoc model switching to adapt to changing demand. The lower level solves a max-flow problem to assign workloads to heterogeneous replicas, while the upper level progressively refines deployment to maximize throughput, guided by flow feedback. A dedicated workload predictor performs per-type arrival-rate forecasting for short horizons, enabling proactive strategy changes with minimal switching overhead via a greedy interconnect-aware parameter transfer plan. Empirical results on real traces with models up to 70B parameters show OServe delivering up to twofold improvements in end-to-end latency and throughput (average ~1.5×) over state-of-the-art baselines, and demonstrate scalability, robustness to prediction errors, and practical applicability in dynamic, heterogeneous GPU clusters.

Abstract

Serving Large Language Models (LLMs) can benefit immensely from parallelizing both the model and input requests across multiple devices, but incoming workloads exhibit substantial spatial and temporal heterogeneity. Spatially, workloads comprise heterogeneous requests with varying compute and memory demands. Temporally, workload composition varies over time. Nevertheless, existing systems typically assume spatially uniform and temporally stable workloads, employing a homogeneous, static model deployment. This mismatch between the assumption and real-world spatial-temporal heterogeneity results in suboptimal performance. We present OServe, an LLM serving system with heterogeneous and flexible model deployment that addresses both spatial and temporal heterogeneity. First, OServe introduces a novel workload-aware scheduling algorithm that optimizes heterogeneous model deployments according to real-time workload characteristics. Second, OServe proposes an efficient workload-adaptive switching method that migrates model deployments in response to predicted workload changes. Experiments on real-world traces show that OServe improves performance by up to 2$\times$ (average: 1.5$\times$) compared to state-of-the-art serving systems.

OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

TL;DR

Abstract

(average: 1.5

) compared to state-of-the-art serving systems.

Paper Structure (27 sections, 19 figures, 1 table, 2 algorithms)

This paper contains 27 sections, 19 figures, 1 table, 2 algorithms.

Introduction
Background
Workload-aware Scheduling
Scheduling Problem Statement
Lower-level Workload Assignment
Upper-Level Model Deployment
Workload-adaptive Switching
Workload Prediction
Ad Hoc Model Switching
Experimental Evaluation
Experimental Setup
End-to-end Performance
Case and Ablation Studies
Algorithm Efficiency
Conclusion
...and 12 more sections

Figures (19)

Figure 1: Performance comparisons of different parallelism strategies across resource allocations and workload types. The two workload types are subsampled from real-world traces in the Azure Public Dataset patel2024splitwise.
Figure 2: Temporal evolution of workload composition and arrival rates derived from real-world traces in the Azure Public Dataset patel2024splitwise.
Figure 3: Example of model deployment and workload assignment.
Figure 4: Illustration of the flow network. $a \mid b$ denotes that $a$ is the used capacity out of $b$.
Figure 5: Example of flow network guided generation.
...and 14 more figures

OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

TL;DR

Abstract

OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

Authors

TL;DR

Abstract

Table of Contents

Figures (19)