SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling
Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St. Amant, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan
TL;DR
The paper analyzes production LLM inference traces from Microsoft O365 to reveal latency-tiered IW/NIW patterns and inter-region imbalances. It then introduces SageServe, a holistic framework that unifies a shared GPU pool, region-aware routing, and multi-timescale autoscaling guided by ARIMA forecasts and an ILP optimizer. By combining proactive forecast-driven provisioning with memory-based reactive adjustments and a NIW queue, SageServe achieves substantial GPU utilization improvements and cost savings (up to 25% GPU-hours saved and around $2.5M/month) while preserving SLA integrity. Evaluations using a SplitWise-based simulator and real-world traces across three regions and four open-source models demonstrate burst resilience, MoE scalability, and favorable comparisons to state-of-the-art baselines like Chiron. The work provides open traces and artifacts to enable community benchmarking and practical deployment insights for cloud-scale LLM serving.
Abstract
Global cloud service providers handle inference workloads for Large Language Models (LLMs) that span latency-sensitive (e.g., chatbots) and insensitive (e.g., report writing) tasks, resulting in diverse and often conflicting Service Level Agreement (SLA) requirements. Managing such mixed workloads is challenging due to the complexity of the inference serving stack, which encompasses multiple models, GPU hardware, and global data centers. Existing solutions often silo such fast and slow tasks onto separate GPU resource pools with different SLAs, but this leads to significant under-utilization of expensive accelerators due to load mismatch. In this article, we characterize the LLM serving workloads at Microsoft Office 365, one of the largest users of LLMs within Microsoft Azure cloud with over 10 million requests per day, and highlight key observations across workloads in different data center regions and across time. This is one of the first such public studies of Internet-scale LLM workloads. We use these insights to propose SageServe, a comprehensive LLM serving framework that dynamically adapts to workload demands using multi-timescale control knobs. It combines short-term request routing to data centers with long-term scaling of GPU VMs and model placement with higher lead times, and co-optimizes the routing and resource allocation problem using a traffic forecast model and an Integer Linear Programming (ILP) solution. We evaluate SageServe through real runs and realistic simulations on 10 million production requests across three regions and four open-source models. We achieve up to 25% savings in GPU-hours compared to the current baseline deployment and reduce GPU-hour wastage due to inefficient auto-scaling by 80%, resulting in a potential monthly cost savings of up to $2.5 million, while maintaining tail latency and meeting SLAs.
