Table of Contents
Fetching ...

No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha

Amey Agrawal, Haoran Qiu, Junda Chen, Íñigo Goiri, Chaojie Zhang, Rayyan Shahid, Ramachandran Ramjee, Alexey Tumanov, Esha Choukse

TL;DR

Medha tackles convoy effects in long-context LLM inference by combining fine-grained preemption, adaptive chunking, and a novel 3D parallelism stack (Stream Pipeline Parallelism and KV-Cache Parallelism) with a Length-Aware Relative Slack (LARS) scheduler. It demonstrates that chunked prefills can be efficient due to high arithmetic intensity, debunking KV-cache read amplification and enabling scalable preemptive inference with dynamic batching that respects SLOs. Across real workloads and large-scale GPU clusters, Medha delivers up to 5.7× throughput and dramatic reductions in median and tail latency (up to 30× and 174×, respectively) compared with non-preemptive baselines. The work provides a practical path to deploying heterogeneous, multi-million-token contexts in production, demonstrating that preemption is viable and beneficial for future large-context LLM serving.

Abstract

Deploying million-token Large Language Models (LLMs) is challenging because production workloads are highly heterogeneous, mixing short queries and long documents. This heterogeneity, combined with the quadratic complexity of attention, creates severe convoy effects where long-running requests stall short, interactive ones, degrading system responsiveness. We present Medha, a serving system that eliminates these convoys by introducing fine-grained, preemptive scheduling to LLM inference. Medha makes preemption practical with a co-designed set of mechanisms -- including Adaptive Chunking and Stream Pipeline Parallel that overcome the perceived inefficiencies and scaling challenges of chunking. Additionally, we present a new parallelism strategy KV-Cache Parallelism to reduce the decode latency and afford interactivity despite very long context. These mechanisms are orchestrated by a Length-Aware Relative Slack (LARS) scheduler, a deadline and heterogeneity-aware scheduling policy that prevents both the convoy effect and the starvation that plagues simpler policies. Under a heterogeneous workload, Medha improves throughput by 5.7x while reducing median and 99th percentile latency by 30x and 174x, respectively, compared to state-of-the-art non-preemptive systems.

No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha

TL;DR

Medha tackles convoy effects in long-context LLM inference by combining fine-grained preemption, adaptive chunking, and a novel 3D parallelism stack (Stream Pipeline Parallelism and KV-Cache Parallelism) with a Length-Aware Relative Slack (LARS) scheduler. It demonstrates that chunked prefills can be efficient due to high arithmetic intensity, debunking KV-cache read amplification and enabling scalable preemptive inference with dynamic batching that respects SLOs. Across real workloads and large-scale GPU clusters, Medha delivers up to 5.7× throughput and dramatic reductions in median and tail latency (up to 30× and 174×, respectively) compared with non-preemptive baselines. The work provides a practical path to deploying heterogeneous, multi-million-token contexts in production, demonstrating that preemption is viable and beneficial for future large-context LLM serving.

Abstract

Deploying million-token Large Language Models (LLMs) is challenging because production workloads are highly heterogeneous, mixing short queries and long documents. This heterogeneity, combined with the quadratic complexity of attention, creates severe convoy effects where long-running requests stall short, interactive ones, degrading system responsiveness. We present Medha, a serving system that eliminates these convoys by introducing fine-grained, preemptive scheduling to LLM inference. Medha makes preemption practical with a co-designed set of mechanisms -- including Adaptive Chunking and Stream Pipeline Parallel that overcome the perceived inefficiencies and scaling challenges of chunking. Additionally, we present a new parallelism strategy KV-Cache Parallelism to reduce the decode latency and afford interactivity despite very long context. These mechanisms are orchestrated by a Length-Aware Relative Slack (LARS) scheduler, a deadline and heterogeneity-aware scheduling policy that prevents both the convoy effect and the starvation that plagues simpler policies. Under a heterogeneous workload, Medha improves throughput by 5.7x while reducing median and 99th percentile latency by 30x and 174x, respectively, compared to state-of-the-art non-preemptive systems.
Paper Structure (34 sections, 1 equation, 20 figures, 1 table, 3 algorithms)

This paper contains 34 sections, 1 equation, 20 figures, 1 table, 3 algorithms.

Figures (20)

  • Figure 1: Impact of long-context requests on TTFT for Llama-3 8B inference using 16 A100 GPUs with LoongServe 2024loongserve and Medha at 0.75 QPS.
  • Figure 2: Impact of preemption on convoy effect. Non-preemptive scheduling (top) blocks short requests B and C behind long request A, causing deadline violations. Preemptive scheduling (bottom) interleaves execution through chunking, eliminating convoy effect while maintaining throughput.
  • Figure 3: Efficacy of chunked prefill for long-context inference.
  • Figure 4: Pareto frontiers of prefill/decode latencies in mixed batching with chunked prefills: (a) Static sizes have a trade-off between prefill and decode latencies. (b) Adaptive chunking starts with larger chunks, gradually reducing size to keep batch latencies consistent, achieving better prefill efficiency and low decode latency.
  • Figure 5: Microbatched pipeline parallelism interleaves micro-batches composed of prefills from different requests ($R1$, $R2$) to improve throughput. SPP on the other hand, overlaps chunks of the same request ($R1_1$, $R2_2$) across stages to accelerate prefill processing. SPP achieves better scaling compared to CP due to lower communication overhead, resulting in up to 1.64$\times$ lower prefill latency for 1M context processing for Llama-3 8B with H100s.
  • ...and 15 more figures