EPIC: Efficient Position-Independent Caching for Serving Large Language Models

Junhao Hu; Wenrui Huang; Weidong Wang; Haoyi Wang; Tiancheng Hu; Qin Zhang; Hao Feng; Xusheng Chen; Yizhou Shan; Tao Xie

EPIC: Efficient Position-Independent Caching for Serving Large Language Models

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie

TL;DR

This work formalizes Position-Independent Caching (PIC) for LLM serving and introduces EPIC, a system that uses the LegoLink algorithm to dramatically reduce link-time recomputation while preserving accuracy. By recomputing only a small, strategically chosen subset of tokens (the initial tokens of each immutable chunk) and leveraging static attention sparsity, LegoLink achieves up to 8x TTFT reduction and 7x throughput gains over prior PIC approaches. The two-step compile/link framework enables modular reuse of KV vectors across varying prefixes, with empirical results across six datasets and three models demonstrating significant latency and efficiency improvements and minimal accuracy loss. The work advances practical long-context and retrieval-augmented generation deployments by providing explicit cache management and efficient linking techniques.

Abstract

Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-Independent Caching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate "attention sink" effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8x improvements in Time-To-First-Token (TTFT) and 7x throughput gains over existing systems, with negligible or no accuracy loss.

EPIC: Efficient Position-Independent Caching for Serving Large Language Models

TL;DR

Abstract

EPIC: Efficient Position-Independent Caching for Serving Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)