BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference
Ahmed Burak Gulhan, Krishna Teja Chitty-Venkata, Murali Emani, Mahmut Kandemir, Venkatram Vishwanath
TL;DR
BaKlaVa tackles the memory bottleneck of KV caches in long-context LLM inference by allocating memory budgets across individual KV-caches through a one-time profiling of attention head importances. It combines a Head Importance Heuristic, Grouped Query Attention handling, and a Layer Importance proxy to estimate per-cache significance, followed by a parameter-search-based budget allocation that favors high-importance caches while shrinking low-importance ones. The approach is evaluated on LLaMA-3-8B and Qwen2.5-7B using LongBench, showing up to substantial compression with maintained or improved performance, and it is implemented as a HuggingFace KV-cache object for practical deployment. The findings demonstrate that fine-grained, per-cache memory budgeting can notably improve long-context inference efficiency without model tuning, enabling near-baseline results at high compression and offering a scalable path for diverse architectures.
Abstract
In Large Language Model (LLM) inference, Key-Value (KV) caches (KV-caches) are essential for reducing time complexity. However, they result in a linear increase in GPU memory as the context length grows. While recent work explores KV-cache eviction and compression policies to reduce memory usage, they often consider uniform KV-caches across all attention heads, leading to suboptimal performance. We introduce BaKlaVa, a method to allocate optimal memory for individual KV-caches across the model by estimating the importance of each KV-cache. Our empirical analysis demonstrates that not all KV-caches are equally critical for LLM performance. Using a one-time profiling approach, BaKlaVa assigns optimal memory budgets to each KV-cache. We evaluated our method on LLaMA-3-8B, and Qwen2.5-7B models, achieving up to a 70\% compression ratio while keeping baseline performance and delivering up to an order-of-magnitude accuracy improvement at higher compression levels.
