Hierarchical Prediction-based Management for LMaaS Systems
Zhihan Jiang, Yujie Huang, Guangba Yu, Junjie Huang, Jiazhen Gu, Michael R. Lyu
TL;DR
This paper tackles LMaaS management by introducing PreServe, a hierarchical prediction-based framework that combines long-term service workload forecasts with short-term per-request load predictions to proactively scale LLM instances and route requests. It integrates a service workload predictor (via mLSTM) and a request load predictor (via DistilBERT with prompt tuning) into a per-instance load anticipator, a proactive scaler, and a load-aware router to balance loads and reduce tail latency. Real-world Azure LMaaS traces and ShareGPT data demonstrate that PreServe reduces peak latency and resource usage while maintaining low overhead, outperforming state-of-the-art baselines in multiple aspects including P99 latency and SLO violations. The work offers practical mechanisms for robust LMaaS management, combining predictive accuracy with operational efficiency and open-source tooling for broader adoption.
Abstract
Large Language Models (LLMs) have revolutionized numerous domains, driving the rise of Language-Model-as-a-Service (LMaaS) platforms that process millions of queries daily. These platforms must minimize latency and meet Service Level Objectives (SLOs) while optimizing resource usage. However, conventional cloud service management techniques, designed for traditional workloads, are suboptimal for LMaaS due to its dynamic service workloads and variable request loads. To address this, we propose PreServe, a tailored LMaaS management framework centered on hierarchical prediction. PreServe incorporates a service workload predictor to estimate periodic token density at a coarse granularity and a novel request load predictor to assess the resource demand of individual LLM requests, enabling the construction of a load anticipator for each LLM instance. By integrating both long-term and short-term predictions, PreServe adjusts resource allocation in advance, mitigating the risks of instance under- or over-provisioning. Besides, PreServe optimizes request routing by considering both current and anticipated future instance loads, ensuring balanced load distribution across instances. Evaluations on real-world production datasets show that PreServe outperforms state-of-the-art methods, reducing tail latency by 41.3%, cutting resource consumption by 49.38%, while incurring only 0.23% additional overhead.
