Table of Contents
Fetching ...

Efficient LLM Scheduling by Learning to Rank

Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang

TL;DR

The paper tackles head-of-line blocking in LLM serving by shifting from predicting exact generation lengths to ranking requests by relative generation-length. It introduces a lightweight OPT-based ranking predictor trained with ListMLE and evaluated via Kendall's Tau to approximate SJF/SRTF scheduling. A rank-based scheduler, including starvation prevention, is deployed atop vLLM, yielding up to 2.8x lower p90 latency in chatbots and 6.5x higher throughput in synthetic data generation. The approach is simple to integrate and demonstrates robust end-to-end performance gains across real workloads with modest predictor overhead.

Abstract

In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at https://github.com/hao-ai-lab/vllm-ltr.git

Efficient LLM Scheduling by Learning to Rank

TL;DR

The paper tackles head-of-line blocking in LLM serving by shifting from predicting exact generation lengths to ranking requests by relative generation-length. It introduces a lightweight OPT-based ranking predictor trained with ListMLE and evaluated via Kendall's Tau to approximate SJF/SRTF scheduling. A rank-based scheduler, including starvation prevention, is deployed atop vLLM, yielding up to 2.8x lower p90 latency in chatbots and 6.5x higher throughput in synthetic data generation. The approach is simple to integrate and demonstrates robust end-to-end performance gains across real workloads with modest predictor overhead.

Abstract

In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at https://github.com/hao-ai-lab/vllm-ltr.git
Paper Structure (23 sections, 3 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 23 sections, 3 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: A long request can block short requests and introduce severe HOL blocking and high latency. We assume there is no prefill time, and the system takes 1 second to generate 1 token. With a First-come-first-serve (FCFS) schedule, the long request R0, which arrives first and takes 10 seconds to generate 10 tokens, will block subsequent shorter requests R1 and R2 for 10 seconds. Hence the latencies of R0, R1, and R2 are $10 / 10 = 1, (10 + 2) / 2 = 6, (10+2+1)/1=13 \hbox{s / token}$, respectively, perceived by users, with an average latency of $(1+6+13)/3 = 6.67 \hbox{s / token}$. By contrast, prioritizing shortest requests yields an average latency of $(1.3+1.5+1)/3=1.27 \hbox{s / token}$ -- a $5.3\times$ reduction in average latency.
  • Figure 2: (a): HOL blocking of 1K requests on ShareGPT datasets. (b): Higher Kendall's Tau, lower latency. Evaluated on ShareGPT dataset with Llama-3-8B model.
  • Figure 3: Mean latency of different schedulers with Llama-3 models on real workloads.
  • Figure 4: Average max_waiting_time across all requests with different scheduling method
  • Figure 5: Influence of starvation prevention on latency
  • ...and 2 more figures