Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints
Ruicheng Ao, Gan Luo, David Simchi-Levi, Xinshang Wang
TL;DR
This work addresses memory-constrained LLM inference where the KV cache grows during decode, causing eviction risk and potential instability under online batching. It introduces a fluid-dynamics benchmark for throughput under memory limits and develops the WAIT policy for known output lengths, plus Nested WAIT for unknown lengths, leveraging threshold-based admission to approach a load-balanced equilibrium. The authors provide asymptotic optimality guarantees, decoupling analyses, and high-probability memory bounds, supported by experiments on Llama-7B that show superior throughput and reduced latency against vLLM and Sarathi. The framework offers a principled, theory-grounded approach to deploying LLMs under strict memory constraints and motivates future work on multi-GPU deployment and integration with cutting-edge KV-cache techniques.
Abstract
Large Language Models (LLMs) power many modern applications, but their inference procedure poses unique scheduling challenges: the Key-Value (KV) cache grows dynamically during response generation, and memory overflow triggers eviction that can cascade into system-wide failures. Even when memory capacity exceeds the theoretical requirement, conventional scheduling algorithms fail because they do not account for this dynamic memory growth -- a system that should be stable can become unstable under poor scheduling. This paper formulates LLM inference optimization as a multi-stage online scheduling problem. We develop a fluid dynamics approximation to establish a tractable benchmark and derive the Waiting for Accumulated Inference Threshold (WAIT) algorithm. WAIT uses threshold-based batching to prevent eviction by keeping the system near load balance, achieving near-optimal throughput when output lengths are known. For practical settings where output lengths are unknown at arrival, we introduce Nested WAIT. Rather than predicting output lengths, Nested WAIT classifies prompts on-the-fly: short prompts complete early and exit, while longer prompts naturally advance to later segments. A safety buffer provides high-probability protection against memory overflow with only logarithmic overhead. Theoretical analysis establishes near-optimal performance in the asymptotic regime. Experiments on Llama-7B with an A100 GPU demonstrate that our approach achieves superior throughput and reduced latency compared to vLLM and Sarathi. This work applies operations research principles to establish a theoretical framework for LLM deployment under memory constraints.
