No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha
Amey Agrawal, Haoran Qiu, Junda Chen, Íñigo Goiri, Chaojie Zhang, Rayyan Shahid, Ramachandran Ramjee, Alexey Tumanov, Esha Choukse
TL;DR
Medha tackles convoy effects in long-context LLM inference by combining fine-grained preemption, adaptive chunking, and a novel 3D parallelism stack (Stream Pipeline Parallelism and KV-Cache Parallelism) with a Length-Aware Relative Slack (LARS) scheduler. It demonstrates that chunked prefills can be efficient due to high arithmetic intensity, debunking KV-cache read amplification and enabling scalable preemptive inference with dynamic batching that respects SLOs. Across real workloads and large-scale GPU clusters, Medha delivers up to 5.7× throughput and dramatic reductions in median and tail latency (up to 30× and 174×, respectively) compared with non-preemptive baselines. The work provides a practical path to deploying heterogeneous, multi-million-token contexts in production, demonstrating that preemption is viable and beneficial for future large-context LLM serving.
Abstract
Deploying million-token Large Language Models (LLMs) is challenging because production workloads are highly heterogeneous, mixing short queries and long documents. This heterogeneity, combined with the quadratic complexity of attention, creates severe convoy effects where long-running requests stall short, interactive ones, degrading system responsiveness. We present Medha, a serving system that eliminates these convoys by introducing fine-grained, preemptive scheduling to LLM inference. Medha makes preemption practical with a co-designed set of mechanisms -- including Adaptive Chunking and Stream Pipeline Parallel that overcome the perceived inefficiencies and scaling challenges of chunking. Additionally, we present a new parallelism strategy KV-Cache Parallelism to reduce the decode latency and afford interactivity despite very long context. These mechanisms are orchestrated by a Length-Aware Relative Slack (LARS) scheduler, a deadline and heterogeneity-aware scheduling policy that prevents both the convoy effect and the starvation that plagues simpler policies. Under a heterogeneous workload, Medha improves throughput by 5.7x while reducing median and 99th percentile latency by 30x and 174x, respectively, compared to state-of-the-art non-preemptive systems.
