Table of Contents
Fetching ...

FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

Nazmul Takbir, Hamidreza Alikhani, Nikil Dutt, Sangeetha Abdu Jyothi

TL;DR

This work addresses the growing GPU memory and compute burden of KV caches in long-context LLM serving by exploiting temporal stability across attention heads. The authors identify stable and unstable KV heads, and implement FlexiCache, a stability-aware hierarchical KV-cache system that keeps only the top-K pages of stable heads on the GPU while retaining full KV data for unstable heads on the host, with periodic, query-aware reranking to promote newly important pages. The approach uses sparse attention over top-K pages, a MinMax-based page scoring mechanism, and efficient data-transfer and block-table techniques to minimize I/O and fragmentation, while maintaining near-dense-attention accuracy (≈100% retention) and delivering substantial improvements: up to 70% GPU memory reduction, 1.38–1.55× offline throughput, and 1.6–2.1× online token latency reductions. Implemented on vLLM, FlexiCache demonstrates strong performance across long-context and long-generation tasks and suggests practical impact for scalable, low-latency LLM serving in real systems.

Abstract

Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of critical tokens, yet existing systems struggle to exploit this efficiently without degrading accuracy, especially in long generation. We make a key observation: the temporal stability of these critical tokens varies significantly across KV heads: some heads consistently focus on the same tokens, while others shift frequently. Building on this insight, we introduce FlexiCache, a hierarchical KV-cache management system that leverages the temporal stability of KV heads to reduce GPU memory usage and computation overhead, while preserving model accuracy. FlexiCache classifies KV heads as stable or unstable: it retains all KV-cache pages from unstable heads in GPU memory, whereas for stable heads, it keeps only the top-K pages on the GPU and offloads the rest to host memory. By exploiting temporal stability, FlexiCache performs periodic reranking for stable heads to fetch newly promoted top pages. Implemented atop vLLM, FlexiCache reduces GPU memory footprint for long-context requests by up to 70%, improves offline serving throughput by 1.38-1.55x, and lowers online token latency by 1.6-2.1x, all while maintaining accuracy in long-context, long-generation scenarios.

FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

TL;DR

This work addresses the growing GPU memory and compute burden of KV caches in long-context LLM serving by exploiting temporal stability across attention heads. The authors identify stable and unstable KV heads, and implement FlexiCache, a stability-aware hierarchical KV-cache system that keeps only the top-K pages of stable heads on the GPU while retaining full KV data for unstable heads on the host, with periodic, query-aware reranking to promote newly important pages. The approach uses sparse attention over top-K pages, a MinMax-based page scoring mechanism, and efficient data-transfer and block-table techniques to minimize I/O and fragmentation, while maintaining near-dense-attention accuracy (≈100% retention) and delivering substantial improvements: up to 70% GPU memory reduction, 1.38–1.55× offline throughput, and 1.6–2.1× online token latency reductions. Implemented on vLLM, FlexiCache demonstrates strong performance across long-context and long-generation tasks and suggests practical impact for scalable, low-latency LLM serving in real systems.

Abstract

Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of critical tokens, yet existing systems struggle to exploit this efficiently without degrading accuracy, especially in long generation. We make a key observation: the temporal stability of these critical tokens varies significantly across KV heads: some heads consistently focus on the same tokens, while others shift frequently. Building on this insight, we introduce FlexiCache, a hierarchical KV-cache management system that leverages the temporal stability of KV heads to reduce GPU memory usage and computation overhead, while preserving model accuracy. FlexiCache classifies KV heads as stable or unstable: it retains all KV-cache pages from unstable heads in GPU memory, whereas for stable heads, it keeps only the top-K pages on the GPU and offloads the rest to host memory. By exploiting temporal stability, FlexiCache performs periodic reranking for stable heads to fetch newly promoted top pages. Implemented atop vLLM, FlexiCache reduces GPU memory footprint for long-context requests by up to 70%, improves offline serving throughput by 1.38-1.55x, and lowers online token latency by 1.6-2.1x, all while maintaining accuracy in long-context, long-generation scenarios.

Paper Structure

This paper contains 23 sections, 3 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Temporal stability patterns of KV heads. For Llama-3.1-8B-Instruct layer 4. Some heads maintain high RCO across offsets, while others show persistently low values.
  • Figure 2: FlexiCache system architecture. At the worker, the top-K selector identifies the most relevant KV pages for each head, updating them at different frequencies based on head stability. The sparse decode kernel attends only to these selected pages. GPU memory stores the full KV cache of unstable heads and only the top-K pages of stable heads, with the rest in host memory. The block allocator manages this hierarchical KV layout, while the KV transfer module and scheduler pipeline host–GPU KV transfers with computation.
  • Figure 3: Hierarchical KV-cache placement. For a request with four logical KV pages running on a two-layer, two-head model.
  • Figure 4: FlexiCache Pipeline. KV offloading is overlapped with computation of the same request, while KV reloading is overlapped with computation of other requests in the batch.
  • Figure 5: End-to-end throughput. FlexiCache consistently outperforms vLLM on both Llama-3.1-8B and Mistral-7B in token throughput, with gains increasing as output length grows. Similar improvements are observed for request throughput with an output length of 500.
  • ...and 5 more figures