Table of Contents
Fetching ...

Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies

Kyoungmin Kim, Jiacheng Li, Kijae Hong, Anastasia Ailamaki

TL;DR

This work tackles the bottlenecks of LLM inference by bringing DBMS-inspired scheduling and cache management to GPU-based KV caches. It introduces InferMax, a framework that combines a cost-model-based simulator, a real inference system, and a CSP formulation to explore optimal schedules under memory contention and dynamic workloads; it further proposes a cost-aware preemption policy and a Shortest-Request First cache replacement to maximize progress and minimize recomputation. The key findings show that short requests can benefit from preemption, enabling substantial latency and GPU-hour improvements, while larger requests favor avoiding preemption; the five-minute rule provides practical guidance for KV retention, and CSP results offer performance upper bounds and insight into when preemption is advantageous. Collectively, the approach demonstrates multi-faceted, DBMS-informed strategies can yield meaningful GPU savings (potentially millions of dollars monthly in large deployments) and provide concrete, implementable guidelines for integrating LLMs into performance-critical data systems.

Abstract

LLMs are increasingly used world-wide from daily tasks to agentic systems and data analytics, requiring significant GPU resources. LLM inference systems, however, are slow compared to database systems, and inference performance and mechanism have been often regarded as a black box, limiting the expansion of the use of LLMs inside databases and other performance-critical applications. This paper first analyzes the LLM inference performance and focuses on a data management issue inside LLM inference. We find that inference systems lack an adequate resource cost model and optimization strategy to schedule requests with their intermediate results in a cache reside in GPU memory when executing multiple concurrent inference requests. We adapt classic database techniques by building cost models for concurrent inference requests and a new cache replacement policy tailored for LLM inference, which can substantially save GPU costs.

Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies

TL;DR

This work tackles the bottlenecks of LLM inference by bringing DBMS-inspired scheduling and cache management to GPU-based KV caches. It introduces InferMax, a framework that combines a cost-model-based simulator, a real inference system, and a CSP formulation to explore optimal schedules under memory contention and dynamic workloads; it further proposes a cost-aware preemption policy and a Shortest-Request First cache replacement to maximize progress and minimize recomputation. The key findings show that short requests can benefit from preemption, enabling substantial latency and GPU-hour improvements, while larger requests favor avoiding preemption; the five-minute rule provides practical guidance for KV retention, and CSP results offer performance upper bounds and insight into when preemption is advantageous. Collectively, the approach demonstrates multi-faceted, DBMS-informed strategies can yield meaningful GPU savings (potentially millions of dollars monthly in large deployments) and provide concrete, implementable guidelines for integrating LLMs into performance-critical data systems.

Abstract

LLMs are increasingly used world-wide from daily tasks to agentic systems and data analytics, requiring significant GPU resources. LLM inference systems, however, are slow compared to database systems, and inference performance and mechanism have been often regarded as a black box, limiting the expansion of the use of LLMs inside databases and other performance-critical applications. This paper first analyzes the LLM inference performance and focuses on a data management issue inside LLM inference. We find that inference systems lack an adequate resource cost model and optimization strategy to schedule requests with their intermediate results in a cache reside in GPU memory when executing multiple concurrent inference requests. We adapt classic database techniques by building cost models for concurrent inference requests and a new cache replacement policy tailored for LLM inference, which can substantially save GPU costs.

Paper Structure

This paper contains 23 sections, 10 equations, 22 figures, 4 tables, 1 algorithm.

Figures (22)

  • Figure 1: Overview of InferMax. Solid/dashed arrows indicate deployment/development phase. Orange boxes require actual GPUs while blue boxes do not. CSP denotes constraint satisfaction problem, and there is an omitted arrow from the cost model to CSP.
  • Figure 2: Cache management in LLM inference. Circled numbers show the three steps occurring batch-wise. $r$: request, $\mathcal{B}$: batch, EOS: end-of-sequence. If the KV cache size is 8, $r_3$ is preempted at $\mathcal{B}_2$ as the processed tokens of $r_1$, $r_2$, and $r_3$ would otherwise be 4+2+3 > 8 after processing $\mathcal{B}_2$.
  • Figure 3: A layer of Transformer architecture. Gray boxes are operators (omitted layernorms, activations, and other operators with negligible overheads). Blue boxes are model weights (matrices), where the arrow located on the left/right indicates input/output of matrix multiplication (matmul). Model weights are partitioned across two GPUs assuming tensor parallelism degree of 2. All_Reduce adds intermediate results from all GPUs. Please refer to Table \ref{['table:notation']} for symbols.
  • Figure 4: GPU times measured for a layer of the Llama-2-7B model on one A100 (blue) and H100 (red). Non-attention adds all non-attention operators. Black lines are single-variable linear regressions with $R^2$ scores over 0.96.
  • Figure 5: Operator time breakdown for prefill and decode batches on H100. Matmul includes all '*_proj' in Figure \ref{['fig:transformer_layer']} except QKV_proj. All $B$ requests in a batch have the same $c$ and $m$ values.
  • ...and 17 more figures