Queue management for slo-oriented large language model serving
Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer
TL;DR
This paper presents QLM, a queue-management framework for SLO-oriented LLM serving that jointly orchestrates batch and interactive workloads across heterogeneous models and hardware. Central to QLM is a Request Waiting Time (RWT) estimator, which, together with request groups and virtual queues, enables a global LP-based scheduler to assign and reorder work and drive four LLM Serving Operations (pulling, eviction, swapping, load balancing). The approach achieves substantial improvements in SLO attainment (40–90%) and throughput (20–400%) on real-world, multi-model workloads and heterogeneous GPUs, while maintaining high device utilization. The work demonstrates practical viability by integrating with vLLM, evaluating on multiple LLMs (e.g., Mistral-7B, Vicuna-13B, Llama-70B) across A10/A100 clusters, and using production-like traces (ShareGPT). These results underscore the value of SLO-aware queue management and LSO orchestration for scalable, efficient LLM serving in cloud environments.
Abstract
Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, when such systems execute batch requests that have relaxed SLOs along with interactive requests, it leads to poor multiplexing and inefficient resource utilization. To address these challenges, we propose QLM, a queue management system for LLM serving. QLM maintains batch and interactive requests across different models and SLOs in a request queue. Optimal ordering of the request queue is critical to maintain SLOs while ensuring high resource utilization. To generate this optimal ordering, QLM uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue. These estimates are used by a global scheduler to orchestrate LLM Serving Operations (LSOs) such as request pulling, request eviction, load balancing, and model swapping. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems. QLM's evaluation is based on the production requirements of a cloud provider. QLM is publicly available at https://www.github.com/QLM-project/QLM.
