Table of Contents
Fetching ...

KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang

TL;DR

This work tackles the inefficiency of re-encoding overlapping context in large language models by proposing KVLink, a method that precomputes KV caches per document and concatenates them at inference. It addresses cross-document attention gaps through KV cache positional re-encoding and learnable link tokens that reconnect independently encoded segments, with optional cache compression to reduce storage needs. Empirical results across multiple QA and summarization datasets show KVLink improves accuracy relative to strong baselines and dramatically cuts time-to-first-token latency (up to 96%), while preserving general capabilities across model sizes. These findings demonstrate KVLink as a practical, scalable solution for context reuse in retrieval-augmented and multi-segment input scenarios.

Abstract

We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we investigate a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation when using KV caches computed independently for each document, KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, and using trainable special tokens to restore self-attention across independently encoded documents. Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods. Furthermore, by leveraging precomputed KV caches, our approach reduces time-to-first-token by up to 96% compared to standard LLM inference, making it a scalable and efficient solution for context reuse. Additionally, KVLink can be combined with KV cache compression to further save cache loading and storage overhead while outperforming the baselines.

KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

TL;DR

This work tackles the inefficiency of re-encoding overlapping context in large language models by proposing KVLink, a method that precomputes KV caches per document and concatenates them at inference. It addresses cross-document attention gaps through KV cache positional re-encoding and learnable link tokens that reconnect independently encoded segments, with optional cache compression to reduce storage needs. Empirical results across multiple QA and summarization datasets show KVLink improves accuracy relative to strong baselines and dramatically cuts time-to-first-token latency (up to 96%), while preserving general capabilities across model sizes. These findings demonstrate KVLink as a practical, scalable solution for context reuse in retrieval-augmented and multi-segment input scenarios.

Abstract

We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we investigate a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation when using KV caches computed independently for each document, KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, and using trainable special tokens to restore self-attention across independently encoded documents. Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods. Furthermore, by leveraging precomputed KV caches, our approach reduces time-to-first-token by up to 96% compared to standard LLM inference, making it a scalable and efficient solution for context reuse. Additionally, KVLink can be combined with KV cache compression to further save cache loading and storage overhead while outperforming the baselines.

Paper Structure

This paper contains 38 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Standard approach (top) encodes the KV cache of each document conditioned on preceding tokens, resulting in redundant and nonreusable KV cache encoding for shared documents (e.g., $\mathrm{Doc_{b}}$). In contrast, our setting (bottom) encodes documents separately, allowing KV cache reuse across queries.
  • Figure 2: Left: the attention map for all tokens. The link1 token attends only to the tokens in $\mathrm{Doc_{a}}$; link2 attends to the tokens in $\mathrm{Doc_{a}}$ and $\mathrm{Doc_{b}}$ and link2; and link3 attends to the tokens in all reused contexts and other link tokens. Right: the attention map for three link tokens and the first user input token during inference. These two attention maps are identical.
  • Figure 3: Inference speed comparison with ten reused contexts of varying lengths. Both KVLink1 and KVLink5 show considerably lower Time-to-First-Token (TTFT) than standard decoding as context size grows.
  • Figure 4: Data Preprocess for Context Reuse.
  • Figure 5: System Prompts Used for Training. We employ tailored system prompts for three primary task types—SFT, QA, and Summarization—reflecting different objectives and guiding the model’s responses accordingly.
  • ...and 2 more figures