Table of Contents
Fetching ...

LLM Query Scheduling with Prefix Reuse and Latency Constraints

Gregory Dexter, Shao Tang, Ata Fatahi Baarzi, Qingquan Song, Tejas Dharamsi, Aman Gupta

TL;DR

This work studies online scheduling of RadixAttention-enabled LLM queries under strict TTFT constraints, addressing how prefix reuse interacts with arrival dynamics. It formalizes a non-preemptive, prefix-aware scheduling model, proves TTFT feasibility is NP-Hard via a 3-PARTITION reduction, and introduces the $k$-LPM algorithm that generalizes FCFS and LPM to better balance reuse and waiting time. Theoretical results show $k$-LPM achieves improved TTFT under realistic data-generation assumptions, complemented by empirical validation on Llama-3.1-8B-Instruct demonstrating significant P99 TTFT reductions in practice. The work lays a foundation for prefix-aware, latency-conscious serving of LLMs and points to practical extensions like percentile-based guarantees and distributional traffic models.

Abstract

The efficient deployment of large language models (LLMs) in online settings requires optimizing inference performance under stringent latency constraints, particularly the time-to-first-token (TTFT) and time-per-output-token (TPOT). This paper focuses on the query scheduling problem for LLM inference with prefix reuse, a technique that leverages shared prefixes across queries to reduce computational overhead. Our work reveals previously unknown limitations of the existing first-come-first-serve (FCFS) and longest-prefix-match (LPM) scheduling strategies with respect to satisfying latency constraints. We present a formal theoretical framework for LLM query scheduling under RadixAttention, a prefix reuse mechanism that stores and reuses intermediate representations in a radix tree structure. Our analysis establishes the NP-hardness of the scheduling problem with prefix reuse under TTFT constraints and proposes a novel scheduling algorithm, $k$-LPM, which generalizes existing methods by balancing prefix reuse and fairness in query processing. Theoretical guarantees demonstrate that $k$-LPM achieves improved TTFT performance under realistic traffic patterns captured by a data generative model. Empirical evaluations in a realistic serving setting validates our findings, showing significant reductions in P99 TTFT compared to baseline methods.

LLM Query Scheduling with Prefix Reuse and Latency Constraints

TL;DR

This work studies online scheduling of RadixAttention-enabled LLM queries under strict TTFT constraints, addressing how prefix reuse interacts with arrival dynamics. It formalizes a non-preemptive, prefix-aware scheduling model, proves TTFT feasibility is NP-Hard via a 3-PARTITION reduction, and introduces the -LPM algorithm that generalizes FCFS and LPM to better balance reuse and waiting time. Theoretical results show -LPM achieves improved TTFT under realistic data-generation assumptions, complemented by empirical validation on Llama-3.1-8B-Instruct demonstrating significant P99 TTFT reductions in practice. The work lays a foundation for prefix-aware, latency-conscious serving of LLMs and points to practical extensions like percentile-based guarantees and distributional traffic models.

Abstract

The efficient deployment of large language models (LLMs) in online settings requires optimizing inference performance under stringent latency constraints, particularly the time-to-first-token (TTFT) and time-per-output-token (TPOT). This paper focuses on the query scheduling problem for LLM inference with prefix reuse, a technique that leverages shared prefixes across queries to reduce computational overhead. Our work reveals previously unknown limitations of the existing first-come-first-serve (FCFS) and longest-prefix-match (LPM) scheduling strategies with respect to satisfying latency constraints. We present a formal theoretical framework for LLM query scheduling under RadixAttention, a prefix reuse mechanism that stores and reuses intermediate representations in a radix tree structure. Our analysis establishes the NP-hardness of the scheduling problem with prefix reuse under TTFT constraints and proposes a novel scheduling algorithm, -LPM, which generalizes existing methods by balancing prefix reuse and fairness in query processing. Theoretical guarantees demonstrate that -LPM achieves improved TTFT performance under realistic traffic patterns captured by a data generative model. Empirical evaluations in a realistic serving setting validates our findings, showing significant reductions in P99 TTFT compared to baseline methods.

Paper Structure

This paper contains 21 sections, 4 theorems, 23 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Deciding if there is a processing order in query stream $\mathcal{Q}$ (Definition def:query_stream) such that a TTFT constraint $T$ is satisfied under the computational model of Definition def:llm_computation is an NP-Hard problem.

Figures (3)

  • Figure 1: This figure graphically represents the imposed structure for any feasible schedule in the query stream construction of Theorem \ref{['thm:decision_nphard']}. Note that the only flexibility in the schedule is how the set of strings $\{\mathbf{x}_i\}_{i\in[3m]}$ fits into the $m$ time windows of size $H$. The solid lines represent arrival times and the dashed lines represent processing start times.
  • Figure 2: We measure P99 TTFT versus request rate for five values of the hyperparameter $k$ on 2000 randomly shuffled prompts from the usecase described in 360brew. Note that $k=1$ corresponds to FCFS and $k=\infty$ corresponds to LPM.
  • Figure 3: We measure P50, P90, P95, and P99 TTFT versus request rate for five values of the hyperparameter $k$ on 2000 randomly shuffled prompts from the usecase described in 360brew. Note that $k=1$ corresponds to FCFS and $k=\infty$ corresponds to LPM.

Theorems & Definitions (8)

  • Definition 1
  • Definition 2: LLM Instance Computation
  • Definition 3
  • Theorem 1
  • Definition 4: Regular Arrival Shuffled Queue
  • Theorem 2: LPM/FCFS vs. $k$-LPM
  • Corollary 3
  • Theorem 4