Table of Contents
Fetching ...

RefreshKV: Updating Small KV Cache During Long-form Generation

Fangyuan Xu, Tanya Goyal, Eunsol Choi

TL;DR

RefreshKV introduces a dynamic inference scheme for long-context LLMs that keeps a full KV cache but alternates with a refreshed, smaller partial cache to speed up long-form generation. By selecting when to refresh and which tokens to keep based on attention patterns and query similarity, it maintains accuracy while achieving speedups comparable to eviction-based methods. Across two open models and multiple long-form benchmarks, RefreshKV mitigates failures of eviction strategies on tasks requiring long outputs and even enables new capabilities like longer Chain-of-key generation. Continued pretraining with RefreshKV further enhances perplexity and efficiency, highlighting its practical potential for real-world long-context generation tasks.

Abstract

Generating long sequences of tokens given a long-context input is a very compute-intensive inference scenario for large language models (LLMs). One prominent inference speed-up approach is to construct a smaller key-value (KV) cache, relieving LLMs from computing attention over a long sequence of tokens. While such methods work well to generate short sequences, their performance degrades rapidly for long-form generation. Most KV compression happens once, prematurely removing tokens that can be useful later in the generation. We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation. After each full attention step, we update the smaller KV cache based on the attention pattern over the entire input. Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks. Lastly, we show that continued pretraining with our inference setting brings further gains in performance.

RefreshKV: Updating Small KV Cache During Long-form Generation

TL;DR

RefreshKV introduces a dynamic inference scheme for long-context LLMs that keeps a full KV cache but alternates with a refreshed, smaller partial cache to speed up long-form generation. By selecting when to refresh and which tokens to keep based on attention patterns and query similarity, it maintains accuracy while achieving speedups comparable to eviction-based methods. Across two open models and multiple long-form benchmarks, RefreshKV mitigates failures of eviction strategies on tasks requiring long outputs and even enables new capabilities like longer Chain-of-key generation. Continued pretraining with RefreshKV further enhances perplexity and efficiency, highlighting its practical potential for real-world long-context generation tasks.

Abstract

Generating long sequences of tokens given a long-context input is a very compute-intensive inference scenario for large language models (LLMs). One prominent inference speed-up approach is to construct a smaller key-value (KV) cache, relieving LLMs from computing attention over a long sequence of tokens. While such methods work well to generate short sequences, their performance degrades rapidly for long-form generation. Most KV compression happens once, prematurely removing tokens that can be useful later in the generation. We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation. After each full attention step, we update the smaller KV cache based on the attention pattern over the entire input. Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks. Lastly, we show that continued pretraining with our inference setting brings further gains in performance.

Paper Structure

This paper contains 53 sections, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Left: Illustration of RefreshKV (with $L=5$, $K=3$ and a stride $S=3$) compared to baseline (SnapKV and Full KV) when generating four tokens. The figure shows the computation complexity of attention operation, and the size of the KV cache used at each decoding step for each method. Our approach alternates between inferencing with the partial cache(t=1,2,4) and the full cache(t=3). Compared to eviction-based method (e.g. SnapKV) which completely discard the evicted tokens, RefreshKV updates the partial cache based on attention scores over the entire context during the full attention steps. Right: An example of the chain-of-key task and performance of RefreshKV and the baselines. RefreshKV maintains performances across different length while eviction-based baeslines' performance degrades when generating a chain with more than one key.
  • Figure 2: Pseudocode for RefreshKV. The model prefills the prompt with full attention and initialize the partial cache $C_{p}$ cache with attention scores of the last token. For each partial attention step, we decode with the partial cache and append the KV pairs of the input token to the partial cache. We evict the token with the lowest attention score to maintain a fixed-sized partial cache. For the full attention step, we first update the full KV cache with the new tokens decoded with the partial cache, then decode with the full cache and refresh the partial cache.
  • Figure 3: We plot the perplexity ratio against the vanilla baseline for RefreshKV (with stride of 10) and SnapKV based on the tokens generated (x axis). While the ratio is similar at the beginning of the sequence, as the generation goes SnapKV's perplexity diverges from vanilla approach while that of RefreshKV is relatively stable.
  • Figure 4: Effective stride across layer for Llama-3.1-8B (similarity threshold=0.85) and Qwen2-7B (similarity trheshold=0.95) in three datasets. We sample 10 examples from each dataset to esimate the effective stride.