Table of Contents
Fetching ...

LLM Serving Optimization with Variable Prefill and Decode Lengths

Meixuan Wang, Yinyu Ye, Zijie Zhou

TL;DR

This work investigates offline scheduling for LLM serving under a fixed KV-cache memory budget when requests exhibit heterogeneous prompt and decode lengths. It introduces Sorted-F, a rigorous batching algorithm that optimizes a novel quality metric $F(\mathcal{X}) = \frac{\sum o_i}{|\mathcal{X}|^2}$ and proves a constant competitive ratio of at most 48 relative to the optimum, alongside practical exact and heuristic variants. The authors also present LP-guided extensions (Sorted-LP and LP-Swap) and a robust adaptation with adaptive output-length refinement to handle prediction uncertainty, plus extensive numerical experiments on mixed short and long prompts. The results show that Sorted-F consistently reduces average latency compared with baselines, and that the proposed LP-based and refinement strategies offer robust performance under diverse workloads and bad-prediction scenarios. Overall, the paper provides a principled, tunable framework for production batch schedulers and capacity planning in memory-constrained LLM serving systems, addressing core latency and memory-management challenges in real-world deployments.

Abstract

We study offline scheduling for large language model (LLM) serving under a fixed KV-cache memory budget, where requests have heterogeneous prompt (prefill) and response (decode) lengths. Prompt tokens determine initial KV usage, and each generated token increases memory by one unit. Given a backlog of n requests arriving together, we schedule mixed prefill and decode batches to minimize total end-to-end latency. We show that heterogeneity in prompt lengths makes the problem computationally intractable and that widely used heuristics such as first-come-first-served and shortest-first can be arbitrarily suboptimal. We propose Sorted-F, which repeatedly forms feasible batches using a new selection metric that balances batch size against downstream decode cost, and prove it achieves a constant-factor guarantee on total latency. We further develop practical variants -- an exact solver for small instances and fast heuristics for larger ones -- and evaluate them on a public workload spanning short conversations and long-document summarization, where they consistently reduce average latency relative to standard baselines. Our results highlight that during peak-hour tidal backlogs, greedy GPU packing or short-request prioritization can perform poorly when prompt lengths vary widely, and provide a principled, tunable framework for designing production batch schedulers and planning capacity in memory-constrained LLM serving systems.

LLM Serving Optimization with Variable Prefill and Decode Lengths

TL;DR

This work investigates offline scheduling for LLM serving under a fixed KV-cache memory budget when requests exhibit heterogeneous prompt and decode lengths. It introduces Sorted-F, a rigorous batching algorithm that optimizes a novel quality metric and proves a constant competitive ratio of at most 48 relative to the optimum, alongside practical exact and heuristic variants. The authors also present LP-guided extensions (Sorted-LP and LP-Swap) and a robust adaptation with adaptive output-length refinement to handle prediction uncertainty, plus extensive numerical experiments on mixed short and long prompts. The results show that Sorted-F consistently reduces average latency compared with baselines, and that the proposed LP-based and refinement strategies offer robust performance under diverse workloads and bad-prediction scenarios. Overall, the paper provides a principled, tunable framework for production batch schedulers and capacity planning in memory-constrained LLM serving systems, addressing core latency and memory-management challenges in real-world deployments.

Abstract

We study offline scheduling for large language model (LLM) serving under a fixed KV-cache memory budget, where requests have heterogeneous prompt (prefill) and response (decode) lengths. Prompt tokens determine initial KV usage, and each generated token increases memory by one unit. Given a backlog of n requests arriving together, we schedule mixed prefill and decode batches to minimize total end-to-end latency. We show that heterogeneity in prompt lengths makes the problem computationally intractable and that widely used heuristics such as first-come-first-served and shortest-first can be arbitrarily suboptimal. We propose Sorted-F, which repeatedly forms feasible batches using a new selection metric that balances batch size against downstream decode cost, and prove it achieves a constant-factor guarantee on total latency. We further develop practical variants -- an exact solver for small instances and fast heuristics for larger ones -- and evaluate them on a public workload spanning short conversations and long-document summarization, where they consistently reduce average latency relative to standard baselines. Our results highlight that during peak-hour tidal backlogs, greedy GPU packing or short-request prioritization can perform poorly when prompt lengths vary widely, and provide a principled, tunable framework for designing production batch schedulers and planning capacity in memory-constrained LLM serving systems.

Paper Structure

This paper contains 38 sections, 13 theorems, 74 equations, 10 figures, 4 tables, 7 algorithms.

Key Result

Theorem 1

Suppose each request $i \in [n]$ satisfies $s_i \in [1,M]$, $o_i \in [1,M]$, and $s_i + o_i \leq M$. Then MC-SF has unbounded competitive ratio: $\mathrm{CR}(\textup{MC-SF}) \to \infty$ as $M \to \infty$.

Figures (10)

  • Figure 1: Illustration of LLM inference scheduling with prefill/decode mixing and continuous batching. Circles denote prefill operations; squares denote decode tokens; colors distinguish requests. Operations in the same column constitute a batch processed simultaneously. The schedule exhibits PD mixing (batches contain both prefill and decode operations) and continuous batching (requests enter dynamically as memory becomes available).
  • Figure 2: Schematic illustration of Definition \ref{['def:trans_separate']} (from $\textup{Sorted-F}$ to $\textup{Sorted-F}_\mathrm{separate}$). The horizontal axis is time; each block represents a request whose width equals its response length.
  • Figure 3: Schematic illustration of Definition \ref{['def:trans_group']} (from $\textup{Sorted-F}_\mathrm{separate}$ to $\textup{Sorted-F}_\mathrm{group}$).
  • Figure 4: Schematic illustration of Definition \ref{['def:trans_align']} (from $\textup{Sorted-F}_\mathrm{group}$ to $\textup{Sorted-F}_\mathrm{align}$).
  • Figure 5: Schematic illustration of Definition \ref{['def:trans_opt_align']} (from $\textup{Optimal}$ to $\textup{Optimal}_\mathrm{align}$).
  • ...and 5 more figures

Theorems & Definitions (34)

  • Theorem 1
  • Theorem 2
  • Example 1
  • Example 2
  • Lemma 1
  • proof
  • Theorem 3
  • Remark 1: Intermediate schedules and feasibility
  • Definition 1: Separation transformation $\textup{Sorted-F}\mapsto \textup{Sorted-F}_\mathrm{separate}$
  • Theorem 4
  • ...and 24 more