Table of Contents
Fetching ...

Using Span Queries to Optimize for Cache and Attention Locality

Paul Castro, Nick Mitchell, Nathan Ordonez, Thomas Parnell, Mudhakar Srivatsa, Antoni Viros i Martin

TL;DR

This work introduces span queries as a general, declarative intermediate representation to optimize cross-request inference across chat, RAG, ITS, and agentic workloads. By encoding commutativity constraints and desugaring span queries into core operators, the authors enable automatic optimizations that improve KV cache locality and attention locality, achieving up to 10–20x TTFT reductions and enabling smaller models to outperform larger baselines. A compact set of vLLM changes (CIDRA repositioning and related scheduler/GPU-runner updates) underpins these gains, demonstrated through RAG and Nested Generation microbenchmarks and bulk-span execution experiments. The approach provides a versatile framework for future optimizations, including gathering/scattering computation patterns, with broad implications for scalable, multi-workload LLM serving.

Abstract

Clients are evolving beyond chat completion, and now include a variety of innovative inference-time scaling and deep reasoning techniques. At the same time, inference servers remain heavily optimized for chat completion. Prior work has shown that large improvements to KV cache hit rate are possible if inference servers evolve towards these non-chat use cases. However, they offer solutions that are also optimized for a single use case, RAG. In this paper, we introduce the span query to generalize the interface to the inference server. We demonstrate that chat, RAG, inference-time scaling, and agentic workloads can all be expressed as span queries. We show how the critical distinction that had been assumed by prior work lies in whether the order of the inputs matter -- do they commute? In chat, they do not. In RAG, they often do. This paper introduces span queries, which are expression trees of inference calls, linked together with commutativity constraints. We describe span query syntax and semantics. We show how they can be automatically optimized to improve KV cache locality. We show how a small change to vLLM (affecting only 492 lines) can enable high-performance execution of span queries. Using this stack, we demonstrate that span queries can achieve 10-20x reductions in TTFT for two distinct non-chat use cases. Finally, we show that span queries can also be optimized to improve attention locality, so as to avoid the so-called lost-in-the-middle problem. We demonstrate that an attention-optimized span query on a 2b parameter model vastly outperforms the accuracy of a stock inference server using an 8b model.

Using Span Queries to Optimize for Cache and Attention Locality

TL;DR

This work introduces span queries as a general, declarative intermediate representation to optimize cross-request inference across chat, RAG, ITS, and agentic workloads. By encoding commutativity constraints and desugaring span queries into core operators, the authors enable automatic optimizations that improve KV cache locality and attention locality, achieving up to 10–20x TTFT reductions and enabling smaller models to outperform larger baselines. A compact set of vLLM changes (CIDRA repositioning and related scheduler/GPU-runner updates) underpins these gains, demonstrated through RAG and Nested Generation microbenchmarks and bulk-span execution experiments. The approach provides a versatile framework for future optimizations, including gathering/scattering computation patterns, with broad implications for scalable, multi-workload LLM serving.

Abstract

Clients are evolving beyond chat completion, and now include a variety of innovative inference-time scaling and deep reasoning techniques. At the same time, inference servers remain heavily optimized for chat completion. Prior work has shown that large improvements to KV cache hit rate are possible if inference servers evolve towards these non-chat use cases. However, they offer solutions that are also optimized for a single use case, RAG. In this paper, we introduce the span query to generalize the interface to the inference server. We demonstrate that chat, RAG, inference-time scaling, and agentic workloads can all be expressed as span queries. We show how the critical distinction that had been assumed by prior work lies in whether the order of the inputs matter -- do they commute? In chat, they do not. In RAG, they often do. This paper introduces span queries, which are expression trees of inference calls, linked together with commutativity constraints. We describe span query syntax and semantics. We show how they can be automatically optimized to improve KV cache locality. We show how a small change to vLLM (affecting only 492 lines) can enable high-performance execution of span queries. Using this stack, we demonstrate that span queries can achieve 10-20x reductions in TTFT for two distinct non-chat use cases. Finally, we show that span queries can also be optimized to improve attention locality, so as to avoid the so-called lost-in-the-middle problem. We demonstrate that an attention-optimized span query on a 2b parameter model vastly outperforms the accuracy of a stock inference server using an 8b model.

Paper Structure

This paper contains 25 sections, 17 figures, 4 tables.

Figures (17)

  • Figure 1: The "dual output paradox": the model server emits one thing to the client and something different to KV cache.
  • Figure 2: Chat completion use case. Each rectangle is a token, a cache block fits 2 tokens, and a token's sequence position is shown in upper left corner. 80% hit rate on Request 2 (4 of 5 input tokens are cached), which asymptotes to 100% as the chat progresses.
  • Figure 3: RAG use case. 33% hit rate on Request 2 (2 of 6 input tokens cached), which asymptotes to 0% as $F_1,F_2$ grow.
  • Figure 4: An example of nested generation: the judge-generator inference-time scaling strategy.
  • Figure 5: Nested generation use case, following on from \ref{['fig:rag-illustration']}. 29% hit rate on Request 3 (2 of 7 input tokens cached), which asymptotes to 0% as assistant output grows.
  • ...and 12 more figures

Theorems & Definitions (4)

  • Definition 1: Nested Generation
  • Definition 2: Span Query
  • Definition 3
  • Definition 4: Span