Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies
Kyoungmin Kim, Jiacheng Li, Kijae Hong, Anastasia Ailamaki
TL;DR
This work tackles the bottlenecks of LLM inference by bringing DBMS-inspired scheduling and cache management to GPU-based KV caches. It introduces InferMax, a framework that combines a cost-model-based simulator, a real inference system, and a CSP formulation to explore optimal schedules under memory contention and dynamic workloads; it further proposes a cost-aware preemption policy and a Shortest-Request First cache replacement to maximize progress and minimize recomputation. The key findings show that short requests can benefit from preemption, enabling substantial latency and GPU-hour improvements, while larger requests favor avoiding preemption; the five-minute rule provides practical guidance for KV retention, and CSP results offer performance upper bounds and insight into when preemption is advantageous. Collectively, the approach demonstrates multi-faceted, DBMS-informed strategies can yield meaningful GPU savings (potentially millions of dollars monthly in large deployments) and provide concrete, implementable guidelines for integrating LLMs into performance-critical data systems.
Abstract
LLMs are increasingly used world-wide from daily tasks to agentic systems and data analytics, requiring significant GPU resources. LLM inference systems, however, are slow compared to database systems, and inference performance and mechanism have been often regarded as a black box, limiting the expansion of the use of LLMs inside databases and other performance-critical applications. This paper first analyzes the LLM inference performance and focuses on a data management issue inside LLM inference. We find that inference systems lack an adequate resource cost model and optimization strategy to schedule requests with their intermediate results in a cache reside in GPU memory when executing multiple concurrent inference requests. We adapt classic database techniques by building cost models for concurrent inference requests and a new cache replacement policy tailored for LLM inference, which can substantially save GPU costs.
