Table of Contents
Fetching ...

EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

Minsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho

TL;DR

EpiCache tackles the memory bottleneck of KV caches in long conversational QA by introducing a training-free, memory-bounded framework that combines block-wise prefill with episodic KV compression. It clusters dialogue into episodes, builds episode-specific caches, and uses adaptive layer-wise budget allocation driven by Key-state sensitivity to preserve long-range context under fixed budgets. Across Realtalk, LoCoMo, and LongMemEval, EpiCache yields up to 40% accuracy improvements over strong baselines at the same budget and approaches full KV performance at 4–6× compression, while delivering up to 3.5× memory savings and 2.4× decoding speedups. This approach enables efficient, scalable long-context interactions in resource-constrained deployments without retraining.

Abstract

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational histories. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly becomes the bottleneck in resource-constrained environments. An active line of research for reducing memory bottleneck is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting the KV cache after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to failure cases in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer's sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40%, maintains near-full KV accuracy under 4-6x compression, and reduces latency/memory by up to 2.4x/3.5x, enabling efficient multi-turn interaction under strict resource limits. Our code is available at https://github.com/apple/ml-epicache.

EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

TL;DR

EpiCache tackles the memory bottleneck of KV caches in long conversational QA by introducing a training-free, memory-bounded framework that combines block-wise prefill with episodic KV compression. It clusters dialogue into episodes, builds episode-specific caches, and uses adaptive layer-wise budget allocation driven by Key-state sensitivity to preserve long-range context under fixed budgets. Across Realtalk, LoCoMo, and LongMemEval, EpiCache yields up to 40% accuracy improvements over strong baselines at the same budget and approaches full KV performance at 4–6× compression, while delivering up to 3.5× memory savings and 2.4× decoding speedups. This approach enables efficient, scalable long-context interactions in resource-constrained deployments without retraining.

Abstract

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational histories. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly becomes the bottleneck in resource-constrained environments. An active line of research for reducing memory bottleneck is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting the KV cache after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to failure cases in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer's sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40%, maintains near-full KV accuracy under 4-6x compression, and reduces latency/memory by up to 2.4x/3.5x, enabling efficient multi-turn interaction under strict resource limits. Our code is available at https://github.com/apple/ml-epicache.

Paper Structure

This paper contains 46 sections, 3 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: KV Cache Management Analysis. (a) Post prefill eviction: eviction after full-context prefill, reducing KV size at decoding but causing unbounded memory usage. (b) Block prefill eviction: input processed in 3-token blocks with patched prompts for scoring, then evicted to 1 token. (c) Top: Peak GPU memory vs. input length on LLaMA-3.2-3B with A100. Bottom: LongConvQA accuracy of KV compression methods under post vs. block prefill on LLaMA-3.2-3B.
  • Figure 2: Patched-prompt analysis: LoCoMo results with LLaMA3.1-8B under block prefill. Patched prompts are formed by selecting the top 10%–90% similar conversation utterances to $q_i$.
  • Figure 3: LongConvQA Evaluation Results (Realtalk, LoCoMo, and LongMemEval) results with fixed KV cache budget size-$M$ across four LLMs. The number of episodes (clusters) fixed to $E{=}4$ in all experiments. The average full KV lengths of the three benchmarks are 26K, 21K, and 20K.
  • Figure 4: Memory Scalability up to 100K Context. Conversation histories between user and LLM-based assistant scaled to 100K tokens across four LLMs with LongMemEval. Comparison of InfiniPot and KVzip ($M{=}6$K) with EpiCache (4 episodes, $M{=}6$K–24K).
  • Figure 5: Efficiency Analysis in Multi-Turn Conversation: (a) Per-turn decoding latency and peak GPU memory for full KV (100K) and EpiCache ($E{=}4$) with LLaMA-3.2-3B. Query Embed and Match: query encoding and centroid matching, KVs Retrieve: loading episodic cache from CPU to GPU memory. (b) Cumulative episode switches in Realtalk with $E{=}4$, showcasing how often episodes change across multi-turn conversation.
  • ...and 5 more figures