LLM Query Scheduling with Prefix Reuse and Latency Constraints
Gregory Dexter, Shao Tang, Ata Fatahi Baarzi, Qingquan Song, Tejas Dharamsi, Aman Gupta
TL;DR
This work studies online scheduling of RadixAttention-enabled LLM queries under strict TTFT constraints, addressing how prefix reuse interacts with arrival dynamics. It formalizes a non-preemptive, prefix-aware scheduling model, proves TTFT feasibility is NP-Hard via a 3-PARTITION reduction, and introduces the $k$-LPM algorithm that generalizes FCFS and LPM to better balance reuse and waiting time. Theoretical results show $k$-LPM achieves improved TTFT under realistic data-generation assumptions, complemented by empirical validation on Llama-3.1-8B-Instruct demonstrating significant P99 TTFT reductions in practice. The work lays a foundation for prefix-aware, latency-conscious serving of LLMs and points to practical extensions like percentile-based guarantees and distributional traffic models.
Abstract
The efficient deployment of large language models (LLMs) in online settings requires optimizing inference performance under stringent latency constraints, particularly the time-to-first-token (TTFT) and time-per-output-token (TPOT). This paper focuses on the query scheduling problem for LLM inference with prefix reuse, a technique that leverages shared prefixes across queries to reduce computational overhead. Our work reveals previously unknown limitations of the existing first-come-first-serve (FCFS) and longest-prefix-match (LPM) scheduling strategies with respect to satisfying latency constraints. We present a formal theoretical framework for LLM query scheduling under RadixAttention, a prefix reuse mechanism that stores and reuses intermediate representations in a radix tree structure. Our analysis establishes the NP-hardness of the scheduling problem with prefix reuse under TTFT constraints and proposes a novel scheduling algorithm, $k$-LPM, which generalizes existing methods by balancing prefix reuse and fairness in query processing. Theoretical guarantees demonstrate that $k$-LPM achieves improved TTFT performance under realistic traffic patterns captured by a data generative model. Empirical evaluations in a realistic serving setting validates our findings, showing significant reductions in P99 TTFT compared to baseline methods.
