Table of Contents
Fetching ...

Online Scheduling for LLM Inference with KV Cache Constraints

Patrick Jaillet, Jiashuo Jiang, Konstantina Mellou, Marco Molinaro, Chara Podimata, Zijie Zhou

TL;DR

This work addresses online batching and scheduling for LLM inference under KV-cache memory constraints, a setting where memory grows with token generation and decisions must be made within milliseconds. It introduces a formal online model and a hindsight-optimal benchmark via an IP, proves that no deterministic online algorithm can achieve a constant competitive ratio in general, and proposes MC-SF, a polynomial-time online algorithm that uses predicted output lengths to ensure memory feasibility and minimize latency. Theoretical analysis provides upper and lower bounds on MC-SF and OPT, while empirical results on synthetic data and real LLM traces show MC-SF achieving near-optimal latency and outperforming standard baselines, with memory safety maintained across regimes. The findings support more sustainable and cost-effective LLM deployment by enabling efficient, memory-aware batching policies under KV-cache constraints.

Abstract

Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A key challenge in LLM inference is the management of the Key-Value (KV) cache, which reduces redundant computations but introduces memory constraints. In this work, we model LLM inference with KV cache constraints theoretically and propose a novel batching and scheduling algorithm that minimizes inference latency while effectively managing the KV cache's memory. More specifically, we make the following contributions. First, to evaluate the performance of online algorithms for scheduling in LLM inference, we introduce a hindsight optimal benchmark, formulated as an integer program that computes the minimum total inference latency under full future information. Second, we prove that no deterministic online algorithm can achieve a constant competitive ratio when the arrival process is arbitrary. Third, motivated by the computational intractability of solving the integer program at scale, we propose a polynomial-time online scheduling algorithm and show that under certain conditions it can achieve a constant competitive ratio. We also demonstrate our algorithm's strong empirical performance by comparing it to the hindsight optimal in a synthetic dataset. Finally, we conduct empirical evaluations on a real-world public LLM inference dataset, simulating the Llama2-70B model on A100 GPUs, and show that our algorithm significantly outperforms the benchmark algorithms. Overall, our results offer a path toward more sustainable and cost-effective LLM deployment.

Online Scheduling for LLM Inference with KV Cache Constraints

TL;DR

This work addresses online batching and scheduling for LLM inference under KV-cache memory constraints, a setting where memory grows with token generation and decisions must be made within milliseconds. It introduces a formal online model and a hindsight-optimal benchmark via an IP, proves that no deterministic online algorithm can achieve a constant competitive ratio in general, and proposes MC-SF, a polynomial-time online algorithm that uses predicted output lengths to ensure memory feasibility and minimize latency. Theoretical analysis provides upper and lower bounds on MC-SF and OPT, while empirical results on synthetic data and real LLM traces show MC-SF achieving near-optimal latency and outperforming standard baselines, with memory safety maintained across regimes. The findings support more sustainable and cost-effective LLM deployment by enabling efficient, memory-aware batching policies under KV-cache constraints.

Abstract

Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A key challenge in LLM inference is the management of the Key-Value (KV) cache, which reduces redundant computations but introduces memory constraints. In this work, we model LLM inference with KV cache constraints theoretically and propose a novel batching and scheduling algorithm that minimizes inference latency while effectively managing the KV cache's memory. More specifically, we make the following contributions. First, to evaluate the performance of online algorithms for scheduling in LLM inference, we introduce a hindsight optimal benchmark, formulated as an integer program that computes the minimum total inference latency under full future information. Second, we prove that no deterministic online algorithm can achieve a constant competitive ratio when the arrival process is arbitrary. Third, motivated by the computational intractability of solving the integer program at scale, we propose a polynomial-time online scheduling algorithm and show that under certain conditions it can achieve a constant competitive ratio. We also demonstrate our algorithm's strong empirical performance by comparing it to the hindsight optimal in a synthetic dataset. Finally, we conduct empirical evaluations on a real-world public LLM inference dataset, simulating the Llama2-70B model on A100 GPUs, and show that our algorithm significantly outperforms the benchmark algorithms. Overall, our results offer a path toward more sustainable and cost-effective LLM deployment.

Paper Structure

This paper contains 25 sections, 9 theorems, 40 equations, 13 figures, 1 table, 2 algorithms.

Key Result

Theorem 4.1

Every deterministic algorithm has a competitive ratio at least $\Omega(\sqrt{n})$.

Figures (13)

  • Figure 1: Example of online batching and scheduling.
  • Figure 2: Histogram of latency ratio: MC-SF vs Hindsight Optimal. Left: Arrival Model 1. Right: Arrival Model 2.
  • Figure 3: Average End-to-End Latency Across Scheduling Algorithms. Left: High Demand. Right: Low Demand.
  • Figure 4: Per-second Throughput Across Scheduling Algorithms.
  • Figure 5: Average End-to-End Latency Across Scheduling Algorithms Under Prediction Error.
  • ...and 8 more figures

Theorems & Definitions (26)

  • Theorem 4.1
  • Proposition 4.2
  • Theorem 4.3
  • proof : Proof of Theorem \ref{['thm:main']}
  • Lemma 4.4: UB on $\texttt{MC-SF}$
  • Lemma 4.5
  • proof
  • Claim 4.6: Peak to volume
  • proof : Proof of Claim \ref{['claim:vol-around']}
  • proof : Proof of Lemma \ref{['lemma:UBAlg']}
  • ...and 16 more