Table of Contents
Fetching ...

Prompt-Aware Scheduling for Low-Latency LLM Serving

Yiheng Tao, Yihe Zhang, Matthew T. Dearing, Xin Wang, Yuping Fan, Zhiling Lan

TL;DR

PARS addresses latency variability in LLM inference by approximating shortest-job-first scheduling with a pairwise learning-to-rank predictor. It uses BERT-based representations and a margin ranking loss to focus on informative prompt pairs, incorporating reasoning traces when estimating response length. The online scheduler ranks waiting tasks and prioritizes shorter responses, with starvation-control to ensure fairness, and is integrated into vLLM with minimal overhead. Experiments across multiple LLMs and datasets show consistent reductions in average and tail latency and strong cross-model generalization, with open-source release planned.

Abstract

Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.

Prompt-Aware Scheduling for Low-Latency LLM Serving

TL;DR

PARS addresses latency variability in LLM inference by approximating shortest-job-first scheduling with a pairwise learning-to-rank predictor. It uses BERT-based representations and a margin ranking loss to focus on informative prompt pairs, incorporating reasoning traces when estimating response length. The online scheduler ranks waiting tasks and prioritizes shorter responses, with starvation-control to ensure fairness, and is integrated into vLLM with minimal overhead. Experiments across multiple LLMs and datasets show consistent reductions in average and tail latency and strong cross-model generalization, with open-source release planned.

Abstract

Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.

Paper Structure

This paper contains 12 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: PARS Workflow
  • Figure 2: Relative variance from ten inference runs of 30 prompts on Llama 3.1 and DeepSeek R1.