Table of Contents
Fetching ...

Marconi: Prefix Caching for the Era of Hybrid LLMs

Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, Ravi Netravali

TL;DR

Marconi addresses the cross-request inefficiency of prefix caching for Hybrid LLMs by introducing a radix-tree cache with judicious admission and FLOP-aware eviction tailored to the unique properties of SSM states and KV caches. It combines speculative prefix insertion to identify high-reuse opportunities with a compute-aware objective that favors long, high-FLOP prefixes, achieving substantial token hit-rate and latency improvements over Transformer-centric baselines. The results demonstrate strong gains across diverse workloads and model compositions, particularly as SSM dominance grows. The work provides practical mechanisms for deploying efficient long-context serving in Hybrid LLMs and is open-sourced for adoption and further research.

Abstract

Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4$\times$ higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.

Marconi: Prefix Caching for the Era of Hybrid LLMs

TL;DR

Marconi addresses the cross-request inefficiency of prefix caching for Hybrid LLMs by introducing a radix-tree cache with judicious admission and FLOP-aware eviction tailored to the unique properties of SSM states and KV caches. It combines speculative prefix insertion to identify high-reuse opportunities with a compute-aware objective that favors long, high-FLOP prefixes, achieving substantial token hit-rate and latency improvements over Transformer-centric baselines. The results demonstrate strong gains across diverse workloads and model compositions, particularly as SSM dominance grows. The work provides practical mechanisms for deploying efficient long-context serving in Hybrid LLMs and is open-sourced for adoption and further research.

Abstract

Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4 higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.

Paper Structure

This paper contains 31 sections, 2 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Overview of Hybrid models.
  • Figure 2: Prefix caching reuses model states of common prefixes (green) across requests, accelerating inference. Fine-grained checkpointing results in many sparsely-hit entries (blue).
  • Figure 3: Fine-grained caching of token blocks results in many SSM states being cached. This creates sparsely-hit entries in which many SSM states are never reused (a), underutilizing the precious cache capacity. Worse, this creates a huge memory usage even for a single sequence of a 7B model (b), overwhelming and thrashing the cache.
  • Figure 4: Marconi performs a speculative insertion to check if inserting the prefill segment of a sequence results in an intermediate node. If so, the SSM states at the branch point are checkpointed. States at the last decoded token are checkpointed in any case. For ease of visualization, we associate model states with nodes rather than edges.
  • Figure 5: FLOP efficiency of the model states of different 7B models as the sequence length scales. The more SSM layers in the model, the steeper the increase in FLOP efficiency.
  • ...and 9 more figures