Table of Contents
Fetching ...

PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

Ao Wang, Hui Chen, Jiaxin Li, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Zijia Lin, Jungong Han, Guiguang Ding

TL;DR

This work addresses the high memory and compute costs of KV caches in large vision-language models during autoregressive generation. It introduces PrefixKV, which converts per-layer KV retention into a single global prefix configuration by ranking KV vectors within each layer by importance and performing a binary search over a threshold $p$ to meet a compression budget $r$, while retaining maximal contextual information via prefix cumulative priorities $P_l^o$. The method reveals heterogeneous layer-wise importance distributions using Lorenz curves and Gini coefficients, enabling adaptive, data-driven KV retention that outperforms uniform schemes across LVLMs and LLMs; offline estimation using small samples further reduces online overhead. Empirically, PrefixKV achieves state-of-the-art results, delivering substantial inference speedups (e.g., ~1.8x at 20% budget) with minimal degradation in generation quality and demonstrates robustness, generalizability, and compatibility with merging/quantization strategies for broader deployment.

Abstract

Recently, large vision-language models (LVLMs) have rapidly gained popularity for their strong generation and reasoning capabilities given diverse multimodal inputs. However, these models incur significant computational and memory overhead during inference, which greatly hinders the efficient deployment in practical scenarios. The extensive key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. Based on this, recent works have investigated ways to reduce the KV cache size for higher efficiency. Although effective, they generally overlook the distinct importance distributions of KV vectors across layers and maintain the same cache size for each layer during the next token prediction. This results in the significant contextual information loss for certain layers, leading to notable performance decline. To address this, we present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence. It reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. With an adaptive layer-wise KV retention recipe based on binary search, the maximum contextual information can thus be preserved in each layer, facilitating the generation. Extensive experiments demonstrate that our method achieves the state-of-the-art performance compared with others. It exhibits superior inference efficiency and generation quality trade-offs, showing promising potential for practical applications. Code is available at https://github.com/THU-MIG/PrefixKV.

PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

TL;DR

This work addresses the high memory and compute costs of KV caches in large vision-language models during autoregressive generation. It introduces PrefixKV, which converts per-layer KV retention into a single global prefix configuration by ranking KV vectors within each layer by importance and performing a binary search over a threshold to meet a compression budget , while retaining maximal contextual information via prefix cumulative priorities . The method reveals heterogeneous layer-wise importance distributions using Lorenz curves and Gini coefficients, enabling adaptive, data-driven KV retention that outperforms uniform schemes across LVLMs and LLMs; offline estimation using small samples further reduces online overhead. Empirically, PrefixKV achieves state-of-the-art results, delivering substantial inference speedups (e.g., ~1.8x at 20% budget) with minimal degradation in generation quality and demonstrates robustness, generalizability, and compatibility with merging/quantization strategies for broader deployment.

Abstract

Recently, large vision-language models (LVLMs) have rapidly gained popularity for their strong generation and reasoning capabilities given diverse multimodal inputs. However, these models incur significant computational and memory overhead during inference, which greatly hinders the efficient deployment in practical scenarios. The extensive key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. Based on this, recent works have investigated ways to reduce the KV cache size for higher efficiency. Although effective, they generally overlook the distinct importance distributions of KV vectors across layers and maintain the same cache size for each layer during the next token prediction. This results in the significant contextual information loss for certain layers, leading to notable performance decline. To address this, we present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence. It reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. With an adaptive layer-wise KV retention recipe based on binary search, the maximum contextual information can thus be preserved in each layer, facilitating the generation. Extensive experiments demonstrate that our method achieves the state-of-the-art performance compared with others. It exhibits superior inference efficiency and generation quality trade-offs, showing promising potential for practical applications. Code is available at https://github.com/THU-MIG/PrefixKV.

Paper Structure

This paper contains 27 sections, 4 equations, 11 figures, 22 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison between previous methods and ours. Previous methods often keep the same prefix length for priority sequences of KV, i.e., retraining the same cache size for each layer. This causes notable information loss for certain layers. In this example, the first layer loses 30% of information. In contrast, we derive the optimal global prefix configuration to preserve as much information as possible in each layer. In this example, both layers can retain 90% of information, thereby enhancing performance.
  • Figure 2: The lorenz curve of priority sequence for KV vectors in different layers. We observe that different layers exhibit diverse importance distributions in the KV cache. Previous methods (the dashed black line) that keep the same prefix cause the notable information loss in layers with dispersed distributions. In contrast, our method (the dashed red line) maximally retains the amount of contextual information of each layer by adaptively maintaining the maximal prefix cumulative priority. The numbers in parentheses in the legend represent the gini coefficient of priority sequence in each layer. A higher gini index indicates a more concentrated importance distribution. It quantitatively demonstrates the varying importance distributions of KV vectors across layers.
  • Figure 3: (a) The inference process of LVLMs, where the orange and green rectangles denote the KV cache generated during prefilling and utilized during decoding, respectively. After prefilling, the KV cache is layer-wisely compressed according to the proportions specified by PrefixKV, i.e., $\{\boldsymbol{R}_1,..., \boldsymbol{R}_L\}$. During decoding, as the sequence lengthens and cache increases, the KV cache consistently maintains the derived compression proportions by pruning KV at a fixed distance liu2024efficient. (b) The overview of PrefixKV. It employs binary search for cumulative priority sequences of KV to derive the optimal global prefix configuration, which delivers ideal cache size ratio for each layer.
  • Figure 4: The retained KV cache size ratios for each layer of 100 random samples under the compression ratio of 50% and their gini coefficients of the priority sequences for KV vectors in each layer. It shows that different samples exhibit similar and robust characteristics, showing the reasonableness of offline estimation.
  • Figure 4: Prefix config. (Uncompressed: 5.28).
  • ...and 6 more figures