Table of Contents
Fetching ...

EPIC: Efficient Position-Independent Caching for Serving Large Language Models

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie

TL;DR

This work formalizes Position-Independent Caching (PIC) for LLM serving and introduces EPIC, a system that uses the LegoLink algorithm to dramatically reduce link-time recomputation while preserving accuracy. By recomputing only a small, strategically chosen subset of tokens (the initial tokens of each immutable chunk) and leveraging static attention sparsity, LegoLink achieves up to 8x TTFT reduction and 7x throughput gains over prior PIC approaches. The two-step compile/link framework enables modular reuse of KV vectors across varying prefixes, with empirical results across six datasets and three models demonstrating significant latency and efficiency improvements and minimal accuracy loss. The work advances practical long-context and retrieval-augmented generation deployments by providing explicit cache management and efficient linking techniques.

Abstract

Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-Independent Caching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate "attention sink" effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8x improvements in Time-To-First-Token (TTFT) and 7x throughput gains over existing systems, with negligible or no accuracy loss.

EPIC: Efficient Position-Independent Caching for Serving Large Language Models

TL;DR

This work formalizes Position-Independent Caching (PIC) for LLM serving and introduces EPIC, a system that uses the LegoLink algorithm to dramatically reduce link-time recomputation while preserving accuracy. By recomputing only a small, strategically chosen subset of tokens (the initial tokens of each immutable chunk) and leveraging static attention sparsity, LegoLink achieves up to 8x TTFT reduction and 7x throughput gains over prior PIC approaches. The two-step compile/link framework enables modular reuse of KV vectors across varying prefixes, with empirical results across six datasets and three models demonstrating significant latency and efficiency improvements and minimal accuracy loss. The work advances practical long-context and retrieval-augmented generation deployments by providing explicit cache management and efficient linking techniques.

Abstract

Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-Independent Caching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate "attention sink" effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8x improvements in Time-To-First-Token (TTFT) and 7x throughput gains over existing systems, with negligible or no accuracy loss.

Paper Structure

This paper contains 20 sections, 1 equation, 10 figures.

Figures (10)

  • Figure 1: Left: Design space of position-independent context caching. Right: The x-axis shows the computation overhead or TTFT, while the y-axis shows accuracy. Different shades of the same color indicate variants of the same algorithm.
  • Figure 2: An analogy between position-independent code and position-independent cache.
  • Figure 3: The architecture of Epic serving system.
  • Figure 4: Comparison of PIC Algorithms. The area above the dashed line corresponds to the compile step, while the area below corresponds to the link step. KVLink recomputes a subset of tokens, highlighted in dark colors. Four algorithms include Naive, Fully Recompute (FR), CacheBlend, and LegoLink. The bottom right visualizes attention maps (layer 5, head 5 of Llama 3.1 8B) for four decoded tokens. The x-axis marks the position ID of the first token of each chunk. To highlight the differences between attention maps, we normalize the $QK^T$ results to the [0, 1] range using min-max scaling instead of Softmax.
  • Figure 5: Prefill and decode length distribution.
  • ...and 5 more figures