Table of Contents
Fetching ...

Queue management for slo-oriented large language model serving

Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer

TL;DR

This paper presents QLM, a queue-management framework for SLO-oriented LLM serving that jointly orchestrates batch and interactive workloads across heterogeneous models and hardware. Central to QLM is a Request Waiting Time (RWT) estimator, which, together with request groups and virtual queues, enables a global LP-based scheduler to assign and reorder work and drive four LLM Serving Operations (pulling, eviction, swapping, load balancing). The approach achieves substantial improvements in SLO attainment (40–90%) and throughput (20–400%) on real-world, multi-model workloads and heterogeneous GPUs, while maintaining high device utilization. The work demonstrates practical viability by integrating with vLLM, evaluating on multiple LLMs (e.g., Mistral-7B, Vicuna-13B, Llama-70B) across A10/A100 clusters, and using production-like traces (ShareGPT). These results underscore the value of SLO-aware queue management and LSO orchestration for scalable, efficient LLM serving in cloud environments.

Abstract

Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, when such systems execute batch requests that have relaxed SLOs along with interactive requests, it leads to poor multiplexing and inefficient resource utilization. To address these challenges, we propose QLM, a queue management system for LLM serving. QLM maintains batch and interactive requests across different models and SLOs in a request queue. Optimal ordering of the request queue is critical to maintain SLOs while ensuring high resource utilization. To generate this optimal ordering, QLM uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue. These estimates are used by a global scheduler to orchestrate LLM Serving Operations (LSOs) such as request pulling, request eviction, load balancing, and model swapping. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems. QLM's evaluation is based on the production requirements of a cloud provider. QLM is publicly available at https://www.github.com/QLM-project/QLM.

Queue management for slo-oriented large language model serving

TL;DR

This paper presents QLM, a queue-management framework for SLO-oriented LLM serving that jointly orchestrates batch and interactive workloads across heterogeneous models and hardware. Central to QLM is a Request Waiting Time (RWT) estimator, which, together with request groups and virtual queues, enables a global LP-based scheduler to assign and reorder work and drive four LLM Serving Operations (pulling, eviction, swapping, load balancing). The approach achieves substantial improvements in SLO attainment (40–90%) and throughput (20–400%) on real-world, multi-model workloads and heterogeneous GPUs, while maintaining high device utilization. The work demonstrates practical viability by integrating with vLLM, evaluating on multiple LLMs (e.g., Mistral-7B, Vicuna-13B, Llama-70B) across A10/A100 clusters, and using production-like traces (ShareGPT). These results underscore the value of SLO-aware queue management and LSO orchestration for scalable, efficient LLM serving in cloud environments.

Abstract

Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, when such systems execute batch requests that have relaxed SLOs along with interactive requests, it leads to poor multiplexing and inefficient resource utilization. To address these challenges, we propose QLM, a queue management system for LLM serving. QLM maintains batch and interactive requests across different models and SLOs in a request queue. Optimal ordering of the request queue is critical to maintain SLOs while ensuring high resource utilization. To generate this optimal ordering, QLM uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue. These estimates are used by a global scheduler to orchestrate LLM Serving Operations (LSOs) such as request pulling, request eviction, load balancing, and model swapping. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems. QLM's evaluation is based on the production requirements of a cloud provider. QLM is publicly available at https://www.github.com/QLM-project/QLM.
Paper Structure (23 sections, 22 equations, 20 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 22 equations, 20 figures, 4 tables, 1 algorithm.

Figures (20)

  • Figure 1: Previously proposed SLO-oriented serving systems overestimate queue waiting time leading to suboptimal resource usage. (Left) Estimated waiting time when requests are run with Llama-70B on A100 GPUs with vLLM. (Right) Number of GPUs required to maintain 20s time-to-first-token (TTFT) SLO with previous systems vs QLM in single and multi-model scenarios.
  • Figure 2: QLM uses request groups and LLM Serving Operations (LSOs) such as request eviction to minimize resource requirement. Previously proposed systems would use four vLLM instances (compared to two for QLM) due to limitations described in Figure \ref{['fig:waiting_time_misestimation']}.
  • Figure 3: Requests have predictable waiting times in a continuous batching system.
  • Figure 4: Forced request eviction leads to reduction in head-of-line (HOL) blocking time.
  • Figure 5: Model swapping and request pulling can jointly decrease queue drain time.
  • ...and 15 more figures

Theorems & Definitions (5)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Definition 4.1
  • Definition 4.2