Table of Contents
Fetching ...

BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference

Ahmed Burak Gulhan, Krishna Teja Chitty-Venkata, Murali Emani, Mahmut Kandemir, Venkatram Vishwanath

TL;DR

BaKlaVa tackles the memory bottleneck of KV caches in long-context LLM inference by allocating memory budgets across individual KV-caches through a one-time profiling of attention head importances. It combines a Head Importance Heuristic, Grouped Query Attention handling, and a Layer Importance proxy to estimate per-cache significance, followed by a parameter-search-based budget allocation that favors high-importance caches while shrinking low-importance ones. The approach is evaluated on LLaMA-3-8B and Qwen2.5-7B using LongBench, showing up to substantial compression with maintained or improved performance, and it is implemented as a HuggingFace KV-cache object for practical deployment. The findings demonstrate that fine-grained, per-cache memory budgeting can notably improve long-context inference efficiency without model tuning, enabling near-baseline results at high compression and offering a scalable path for diverse architectures.

Abstract

In Large Language Model (LLM) inference, Key-Value (KV) caches (KV-caches) are essential for reducing time complexity. However, they result in a linear increase in GPU memory as the context length grows. While recent work explores KV-cache eviction and compression policies to reduce memory usage, they often consider uniform KV-caches across all attention heads, leading to suboptimal performance. We introduce BaKlaVa, a method to allocate optimal memory for individual KV-caches across the model by estimating the importance of each KV-cache. Our empirical analysis demonstrates that not all KV-caches are equally critical for LLM performance. Using a one-time profiling approach, BaKlaVa assigns optimal memory budgets to each KV-cache. We evaluated our method on LLaMA-3-8B, and Qwen2.5-7B models, achieving up to a 70\% compression ratio while keeping baseline performance and delivering up to an order-of-magnitude accuracy improvement at higher compression levels.

BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference

TL;DR

BaKlaVa tackles the memory bottleneck of KV caches in long-context LLM inference by allocating memory budgets across individual KV-caches through a one-time profiling of attention head importances. It combines a Head Importance Heuristic, Grouped Query Attention handling, and a Layer Importance proxy to estimate per-cache significance, followed by a parameter-search-based budget allocation that favors high-importance caches while shrinking low-importance ones. The approach is evaluated on LLaMA-3-8B and Qwen2.5-7B using LongBench, showing up to substantial compression with maintained or improved performance, and it is implemented as a HuggingFace KV-cache object for practical deployment. The findings demonstrate that fine-grained, per-cache memory budgeting can notably improve long-context inference efficiency without model tuning, enabling near-baseline results at high compression and offering a scalable path for diverse architectures.

Abstract

In Large Language Model (LLM) inference, Key-Value (KV) caches (KV-caches) are essential for reducing time complexity. However, they result in a linear increase in GPU memory as the context length grows. While recent work explores KV-cache eviction and compression policies to reduce memory usage, they often consider uniform KV-caches across all attention heads, leading to suboptimal performance. We introduce BaKlaVa, a method to allocate optimal memory for individual KV-caches across the model by estimating the importance of each KV-cache. Our empirical analysis demonstrates that not all KV-caches are equally critical for LLM performance. Using a one-time profiling approach, BaKlaVa assigns optimal memory budgets to each KV-cache. We evaluated our method on LLaMA-3-8B, and Qwen2.5-7B models, achieving up to a 70\% compression ratio while keeping baseline performance and delivering up to an order-of-magnitude accuracy improvement at higher compression levels.

Paper Structure

This paper contains 29 sections, 7 equations, 7 figures, 4 algorithms.

Figures (7)

  • Figure 1: The attention-head similarity heuristic used in BaKlaVa. By taking the cosine similarity between the input and output, we can calculate how much change there is. The more change between the input and output of the attention head, the more important we assume it is.
  • Figure 2: Cosine similarity heatmap for input and output of attention heads for two different prompts in LLaMA3-8B and Qwen2.5-7B. We chose three representative layers to illustrate that attention head consistency holds across different prompts. The X-axis shows the attention heads in a layer, Y-axis represents each token position in the prompt. Green and red outlines show the highest and lowest column similarity means per layer, that is, the most and least important attention heads respectively. The order of average attention head similarities (the mean of each column, see Algorithm \ref{['alg:profiling']}) stays highly consistent even across different prompts of different lengths, indicating that profiling an LLM one time is sufficient to make KV-cache importance estimations.
  • Figure 3: Comparison of BaKlaVa (varying memory budgets for both layers and KV-caches), SqueezeAttention (varying memory budgets for layers), and StreamingLLM (uniform memory budget for all KV-caches) on different LongBench tasks under various compression settings. The LongBench datasets shown include few-shot learning(triviaqa), coding (repobench-p), multi-document question answering (2wikimqa), and summarization (gov_report). For BaKlaVa and SqueezeAttention, we conducted a parameter search using perplexity as a benchmark to determine the optimal settings for each compression ratio (detailed in Section \ref{['sec:parameter_search']}).
  • Figure 4: Comparison of layer importance heuristics with empirical evaluation results. Layers identified as least important by both heuristic and empirical test scores are highlighted in green, while critical layers are marked in red. LlaMA3-8B and Mistral-7B-v0.1 exhibit strong alignment between heuristic predictions and empirical findings, whereas Qwen2.5-7B shows significant discrepancies. Consequently, applying layer-wise KV-cache memory allocation based on the heuristic to Qwen2.5-7B may result in performance degradation.
  • Figure 5: Comparison of BaKlaVa and other cache methods on LlaMA3-8B using LongBench for different compressions.
  • ...and 2 more figures