OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration
Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, Eiko Yoneki
TL;DR
OServe addresses the dual challenges of spatial and temporal heterogeneity in LLM serving by co-optimizing workload assignment and heterogeneous model deployment through a two-level flow-network framework, complemented by fine-grained workload prediction and ad hoc model switching to adapt to changing demand. The lower level solves a max-flow problem to assign workloads to heterogeneous replicas, while the upper level progressively refines deployment to maximize throughput, guided by flow feedback. A dedicated workload predictor performs per-type arrival-rate forecasting for short horizons, enabling proactive strategy changes with minimal switching overhead via a greedy interconnect-aware parameter transfer plan. Empirical results on real traces with models up to 70B parameters show OServe delivering up to twofold improvements in end-to-end latency and throughput (average ~1.5×) over state-of-the-art baselines, and demonstrate scalability, robustness to prediction errors, and practical applicability in dynamic, heterogeneous GPU clusters.
Abstract
Serving Large Language Models (LLMs) can benefit immensely from parallelizing both the model and input requests across multiple devices, but incoming workloads exhibit substantial spatial and temporal heterogeneity. Spatially, workloads comprise heterogeneous requests with varying compute and memory demands. Temporally, workload composition varies over time. Nevertheless, existing systems typically assume spatially uniform and temporally stable workloads, employing a homogeneous, static model deployment. This mismatch between the assumption and real-world spatial-temporal heterogeneity results in suboptimal performance. We present OServe, an LLM serving system with heterogeneous and flexible model deployment that addresses both spatial and temporal heterogeneity. First, OServe introduces a novel workload-aware scheduling algorithm that optimizes heterogeneous model deployments according to real-time workload characteristics. Second, OServe proposes an efficient workload-adaptive switching method that migrates model deployments in response to predicted workload changes. Experiments on real-world traces show that OServe improves performance by up to 2$\times$ (average: 1.5$\times$) compared to state-of-the-art serving systems.
