Table of Contents
Fetching ...

FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim

TL;DR

FastKV tackles the dual challenge of long-context inference by decoupling prefill computation from KV cache retention and introducing Token-Selective Propagation (TSP). It employs a two-stage prefill where early layers process full-context and later layers propagate only salient tokens, coupled with per-layer KV compression to accelerate decoding. By independently tuning the TSP rate and KV retention rate, FastKV delivers up to $1.82\times$ prefill and $2.87\times$ decoding speedups with accuracy comparable to full-context baselines on long-context benchmarks, demonstrating a practical path to scalable long-context processing. This approach provides significant practical impact for real-time reasoning, retrieval-augmented generation, and code understanding under very long input contexts.

Abstract

While large language models (LLMs) excel at handling long-context sequences, they require substantial prefill computation and key-value (KV) cache, which can heavily burden computational efficiency and memory usage in both prefill and decoding stages. Recent works that compress KV caches with prefill acceleration reduce this cost but inadvertently tie the prefill compute reduction to the decoding KV budget. This coupling arises from overlooking the layer-dependent variation of critical context, often leading to accuracy degradation. To address this issue, we introduce FastKV, a KV cache compression framework designed to reduce latency in both prefill and decoding by leveraging the stabilization of token importance in later layers. FastKV performs full-context computation until a Token-Selective Propagation (TSP) layer, which forwards only the most informative tokens to subsequent layers. From these propagated tokens, FastKV independently selects salient KV entries for caching, thereby decoupling KV budget from the prefill compute reduction based on the TSP decision. This independent control of the TSP rate and KV retention rate enables flexible optimization of efficiency and accuracy. Experimental results show that FastKV achieves speedups of up to 1.82$\times$ in prefill and 2.87$\times$ in decoding compared to the full-context baseline, while matching the accuracy of the baselines that only accelerate the decoding stage. Our code is available at https://github.com/dongwonjo/FastKV.

FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

TL;DR

FastKV tackles the dual challenge of long-context inference by decoupling prefill computation from KV cache retention and introducing Token-Selective Propagation (TSP). It employs a two-stage prefill where early layers process full-context and later layers propagate only salient tokens, coupled with per-layer KV compression to accelerate decoding. By independently tuning the TSP rate and KV retention rate, FastKV delivers up to prefill and decoding speedups with accuracy comparable to full-context baselines on long-context benchmarks, demonstrating a practical path to scalable long-context processing. This approach provides significant practical impact for real-time reasoning, retrieval-augmented generation, and code understanding under very long input contexts.

Abstract

While large language models (LLMs) excel at handling long-context sequences, they require substantial prefill computation and key-value (KV) cache, which can heavily burden computational efficiency and memory usage in both prefill and decoding stages. Recent works that compress KV caches with prefill acceleration reduce this cost but inadvertently tie the prefill compute reduction to the decoding KV budget. This coupling arises from overlooking the layer-dependent variation of critical context, often leading to accuracy degradation. To address this issue, we introduce FastKV, a KV cache compression framework designed to reduce latency in both prefill and decoding by leveraging the stabilization of token importance in later layers. FastKV performs full-context computation until a Token-Selective Propagation (TSP) layer, which forwards only the most informative tokens to subsequent layers. From these propagated tokens, FastKV independently selects salient KV entries for caching, thereby decoupling KV budget from the prefill compute reduction based on the TSP decision. This independent control of the TSP rate and KV retention rate enables flexible optimization of efficiency and accuracy. Experimental results show that FastKV achieves speedups of up to 1.82 in prefill and 2.87 in decoding compared to the full-context baseline, while matching the accuracy of the baselines that only accelerate the decoding stage. Our code is available at https://github.com/dongwonjo/FastKV.

Paper Structure

This paper contains 31 sections, 3 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: (a) Early layers exhibit unstable context focus, reflected by low critical token overlap. (b) Attention distributions are sparse, with Top-K tokens dominating the scores.
  • Figure 2: Illustration of the proposed FastKV scheme. The proposed FastKV introduce Token-Selective Propagation approach to selectively propagate only a limited set of tokens while effectively compressing KV cache.
  • Figure 3: Comparison of normalized L2 distances between hidden states generated by the full-context baseline, TSP, and GemFilter-like methods.
  • Figure 4: End-to-end inference latency breakdown of LLaMA-3.1-8B-Instruct at varying input context lengths (generating 256 tokens).
  • Figure 5: (a) Effect of TSP rate on LongBench average accuracy and prefill latency. (b) Effect of TSP layer index on LongBench average accuracy and prefill latency.
  • ...and 4 more figures