Table of Contents
Fetching ...

MEPIC: Memory Efficient Position Independent Caching for LLM Serving

Qian Wang, Zahra Yousefijamarani, Morgan Lindsay Heisler, Rongzhi Gu, Bai Xiaolong, Shan Yizhou, Wei Zhang, Wang Lan, Ying Xiong, Yong Zhang, Zhenan Fan

TL;DR

MEPIC tackles the memory bottleneck of KV caches in long-context LLM serving by enabling cross-position, cross-request chunk KV reuse within a paged, HBM-resident KV store. It introduces a chunk cache coordinator, a segmentation-and-padding scheme for canonical, block-aligned KV layouts, selective block-level recomputation, and a position-independent NoPE KV format with RoPE fusion in the attention kernel. The approach integrates into the existing vLLM+LMCache stack, achieving substantial HBM reductions (up to 2x–5x) with comparable or improved latency and accuracy across diverse datasets and workloads, especially under long prompts and high concurrency. This yields practical, production-scale benefits for multi-tenant LLM serving without model changes, enabling more scalable long-context inference.

Abstract

Modern LLM applications such as deep-research assistants, coding agents, and Retrieval-Augmented Generation (RAG) systems, repeatedly process long prompt histories containing shared document or code chunks, creating significant pressure on the Key Value (KV) cache, which must operate within limited memory while sustaining high throughput and low latency. Prefix caching partially alleviates some of these costs by reusing KV cache for previously processed tokens, but limited by strict prefix matching. Position-independent caching (PIC) enables chunk-level reuse at arbitrary positions, but requires selective recomputation and positional-encoding (PE) adjustments. However, because these operations vary across queries, KV for the same chunk diverges across requests. Moreover, without page alignment, chunk KV layouts diverge in memory, preventing page sharing. These issues result in only modest HBM savings even when many requests reuse the same content. We present MEPIC, a memory-efficient PIC system that enables chunk KV reuse across positions, requests, and batches. MEPIC aligns chunk KV to paged storage, shifts recomputation from token- to block-level so only the first block is request-specific, removes positional encodings via Rotary Position Embedding (RoPE) fusion in the attention kernel, and makes remaining blocks fully shareable. These techniques eliminate most duplicate chunk KV in HBM, reducing usage by up to 2x over state-of-the-art PIC at comparable latency and accuracy, and up to 5x for long prompts, without any model changes.

MEPIC: Memory Efficient Position Independent Caching for LLM Serving

TL;DR

MEPIC tackles the memory bottleneck of KV caches in long-context LLM serving by enabling cross-position, cross-request chunk KV reuse within a paged, HBM-resident KV store. It introduces a chunk cache coordinator, a segmentation-and-padding scheme for canonical, block-aligned KV layouts, selective block-level recomputation, and a position-independent NoPE KV format with RoPE fusion in the attention kernel. The approach integrates into the existing vLLM+LMCache stack, achieving substantial HBM reductions (up to 2x–5x) with comparable or improved latency and accuracy across diverse datasets and workloads, especially under long prompts and high concurrency. This yields practical, production-scale benefits for multi-tenant LLM serving without model changes, enabling more scalable long-context inference.

Abstract

Modern LLM applications such as deep-research assistants, coding agents, and Retrieval-Augmented Generation (RAG) systems, repeatedly process long prompt histories containing shared document or code chunks, creating significant pressure on the Key Value (KV) cache, which must operate within limited memory while sustaining high throughput and low latency. Prefix caching partially alleviates some of these costs by reusing KV cache for previously processed tokens, but limited by strict prefix matching. Position-independent caching (PIC) enables chunk-level reuse at arbitrary positions, but requires selective recomputation and positional-encoding (PE) adjustments. However, because these operations vary across queries, KV for the same chunk diverges across requests. Moreover, without page alignment, chunk KV layouts diverge in memory, preventing page sharing. These issues result in only modest HBM savings even when many requests reuse the same content. We present MEPIC, a memory-efficient PIC system that enables chunk KV reuse across positions, requests, and batches. MEPIC aligns chunk KV to paged storage, shifts recomputation from token- to block-level so only the first block is request-specific, removes positional encodings via Rotary Position Embedding (RoPE) fusion in the attention kernel, and makes remaining blocks fully shareable. These techniques eliminate most duplicate chunk KV in HBM, reducing usage by up to 2x over state-of-the-art PIC at comparable latency and accuracy, and up to 5x for long prompts, without any model changes.

Paper Structure

This paper contains 47 sections, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of PIC Algorithms. The area above the dashed line corresponds to the compile step, while the area below corresponds to the link step. The naive algorithm doesn't recompute any tokens, whereas the Fully Recompute Algorithm recomputes all tokens (highlights in darker colours). The four other PIC algorithms include KVLink, CacheBlend, EPIC, and MEPIC (our method). MEPIC enables cross‑request HBM reuse to reduce HBM usage thus improving system throughput.
  • Figure 2: MEPIC system overview integrated into a vLLM/LMCache serving stack. The scheduling path constructs a chunk-aware KV placement plan within vLLM’s paged KV store, and the computation path follows this plan to recompute necessary tokens and execute attention with fused RoPE.
  • Figure 3: Scheduling components introduced by MEPIC for chunk-aware KV management. The Hybrid KV Manager coordinates prefix and chunk handling across shared HBM KV blocks, while specialized components enforce canonical chunk alignment, resolve cache residency, and manage allocation and eviction across local and remote tiers. Together, these components integrate chunk KV as a first-class object into vLLM’s scheduling path without changing its execution interface.
  • Figure 4: Segmentation and canonical block alignment in MEPIC. Padding enforces a canonical, block-aligned KV layout, allowing identical chunk segments to reuse the same KV blocks across requests.
  • Figure 5: An example of KV block allocation following segment residency classification. Based on per-segment residency, reusable KV blocks are shared, while non-resident segments are assigned newly allocated blocks from the shared HBM pool. For chunk segments, the first KV block is deterministically recomputed and allocated via the prefix cache, while the remaining blocks form canonical, shareable chunk KV managed by the chunk cache.
  • ...and 3 more figures