Table of Contents
Fetching ...

CoKV: Optimizing KV Cache Allocation via Cooperative Game

Qiheng Sun, Hongwei Zhang, Haocheng Xia, Jiayao Zhang, Jinfei Liu, Kui Ren

TL;DR

CoKV addresses the KV-cache memory bottleneck in long-context LLM inference by treating attention-head contributions as a cooperative game and estimating head importance with a Sliced Shapley Value (SSV). The method allocates per-head KV cache budgets based on normalized cooperative contributions and uses SnapKV for efficient eviction within each head's cache. Across LongBench with LLama-3-8B-Instruct and Mistral-7B, CoKV achieves state-of-the-art results, retaining near-full performance with significantly reduced memory and latency. The approach is compatible with modern inference optimizations like GQA and Flash Attention, offering a scalable solution for resource-constrained, long-context applications.

Abstract

Large language models (LLMs) have achieved remarkable success on various aspects of human life. However, one of the major challenges in deploying these models is the substantial memory consumption required to store key-value pairs (KV), which imposes significant resource demands. Recent research has focused on KV cache budget allocation, with several approaches proposing head-level budget distribution by evaluating the importance of individual attention heads. These methods, however, assess the importance of heads independently, overlooking their cooperative contributions within the model, which may result in a deviation from their true impact on model performance. In light of this limitation, we propose CoKV, a novel method that models the cooperation between heads in model inference as a cooperative game. By evaluating the contribution of each head within the cooperative game, CoKV can allocate the cache budget more effectively. Extensive experiments show that CoKV achieves state-of-the-art performance on the LongBench benchmark using LLama-3-8B-Instruct and Mistral-7B models.

CoKV: Optimizing KV Cache Allocation via Cooperative Game

TL;DR

CoKV addresses the KV-cache memory bottleneck in long-context LLM inference by treating attention-head contributions as a cooperative game and estimating head importance with a Sliced Shapley Value (SSV). The method allocates per-head KV cache budgets based on normalized cooperative contributions and uses SnapKV for efficient eviction within each head's cache. Across LongBench with LLama-3-8B-Instruct and Mistral-7B, CoKV achieves state-of-the-art results, retaining near-full performance with significantly reduced memory and latency. The approach is compatible with modern inference optimizations like GQA and Flash Attention, offering a scalable solution for resource-constrained, long-context applications.

Abstract

Large language models (LLMs) have achieved remarkable success on various aspects of human life. However, one of the major challenges in deploying these models is the substantial memory consumption required to store key-value pairs (KV), which imposes significant resource demands. Recent research has focused on KV cache budget allocation, with several approaches proposing head-level budget distribution by evaluating the importance of individual attention heads. These methods, however, assess the importance of heads independently, overlooking their cooperative contributions within the model, which may result in a deviation from their true impact on model performance. In light of this limitation, we propose CoKV, a novel method that models the cooperation between heads in model inference as a cooperative game. By evaluating the contribution of each head within the cooperative game, CoKV can allocate the cache budget more effectively. Extensive experiments show that CoKV achieves state-of-the-art performance on the LongBench benchmark using LLama-3-8B-Instruct and Mistral-7B models.

Paper Structure

This paper contains 34 sections, 1 theorem, 16 equations, 9 figures, 10 tables, 2 algorithms.

Key Result

Theorem 1

Algorithm Alg:approSV returns an $(\epsilon,\delta)$-approximation of Sliced Shapley value with time complexity $\mathcal{O}( \frac{T|\mathcal{H}|ln\frac{2|\mathcal{H}|}{\delta}}{\epsilon^2})$ where T is the time cost of evaluating a complementary contribution which is the time to inference on the v

Figures (9)

  • Figure 1: Overview of our proposed method: (1) Head Importance Evaluation (Upper Part): For a 4-layer × 4-head model, We measure head importance using the Sliced Shapley Value (SSV). To approximate SSV, we sample $M$ different sets of masked heads and compute their complementary contributions. The average complementary contribution of each head is its estimated SSV. (2) KV Cache Compression (Lower Part): Using the 4 heads in Layer 3 as an example, all heads store KV pairs for a small local window of recent tokens, while heads with higher SSV (darker in the heatmap) are allocated more cache size to retain KV pairs before the local window.
  • Figure 2: Results for varying KV cache sizes (64, 128, 256, 512, 1024), showing the average accuracy across 16 datasets from the LongBench benchmark.
  • Figure 3: Results for varying masked groups (16,32,64,96,128), showing the average accuracy across 16 datasets from the LongBench benchmark.
  • Figure 4: Results of Decoding Latency and Peak Memory Usage, demonstrating that CoKV maintains comparable performance with other baseline methods while achieving significant improvements over FullKV.
  • Figure 5: Heatmap of Llama-3-8B-Instruct.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1
  • Theorem 1