Table of Contents
Fetching ...

On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference

Siyu Ren, Kenny Q. Zhu

TL;DR

The paper tackles the memory bottleneck of key-value caches in autoregressive LLM inference by analyzing eviction policies through two design dimensions: importance score calculation and eviction-scope construction. It reveals inconsistencies in prior methods and proposes RoCo, which uses Mean Attention Score to assess token importance and a standard-deviation-based eviction scope to robustly select evictable tokens. RoCo demonstrates superior downstream performance across language modeling, summarization, context reconstruction, and instruction following, especially under tight KV budgets, and scales well with larger models and grouped-query attention. The authors also release EasyKV, a practical library enabling KV-constrained inference with configurable budgets and eviction policies.

Abstract

Despite the recent success associated with Large Language Models (LLMs), they are notably cost-prohibitive to deploy in resource-constrained environments due to their excessive memory and computational demands. In addition to model parameters, the key-value cache is also stored in GPU memory, growing linearly with batch size and sequence length. As a remedy, recent works have proposed various eviction policies for maintaining the overhead of key-value cache under a given budget. This paper embarks on the efficacy of existing eviction policies in terms of importance score calculation and eviction scope construction. We identify the deficiency of prior policies in these two aspects and introduce RoCo, a robust cache omission policy based on temporal attention scores and robustness measures. Extensive experimentation spanning prefilling and auto-regressive decoding stages validates the superiority of RoCo. Finally, we release EasyKV, a versatile software package dedicated to user-friendly key-value constrained generative inference. Code available at https://github.com/DRSY/EasyKV.

On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference

TL;DR

The paper tackles the memory bottleneck of key-value caches in autoregressive LLM inference by analyzing eviction policies through two design dimensions: importance score calculation and eviction-scope construction. It reveals inconsistencies in prior methods and proposes RoCo, which uses Mean Attention Score to assess token importance and a standard-deviation-based eviction scope to robustly select evictable tokens. RoCo demonstrates superior downstream performance across language modeling, summarization, context reconstruction, and instruction following, especially under tight KV budgets, and scales well with larger models and grouped-query attention. The authors also release EasyKV, a practical library enabling KV-constrained inference with configurable budgets and eviction policies.

Abstract

Despite the recent success associated with Large Language Models (LLMs), they are notably cost-prohibitive to deploy in resource-constrained environments due to their excessive memory and computational demands. In addition to model parameters, the key-value cache is also stored in GPU memory, growing linearly with batch size and sequence length. As a remedy, recent works have proposed various eviction policies for maintaining the overhead of key-value cache under a given budget. This paper embarks on the efficacy of existing eviction policies in terms of importance score calculation and eviction scope construction. We identify the deficiency of prior policies in these two aspects and introduce RoCo, a robust cache omission policy based on temporal attention scores and robustness measures. Extensive experimentation spanning prefilling and auto-regressive decoding stages validates the superiority of RoCo. Finally, we release EasyKV, a versatile software package dedicated to user-friendly key-value constrained generative inference. Code available at https://github.com/DRSY/EasyKV.
Paper Structure (46 sections, 2 equations, 6 figures, 9 tables)

This paper contains 46 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Illustration of KV cache eviction inside one attention layer ($L$ in total). In this example, a single pair of key-value vectors are deleted (red hatched areas) before appending the next token's. Different heads ($H$ in total) at model layers may evict at different positions.
  • Figure 2: Consistency of different importance calculation methods w.r.t their full KV cache variant.
  • Figure 3: Illustration of persistence of attention robustness. We extract attention scores and compute the standard deviation from LLaMa2-7B-Chat.
  • Figure 4: Results of MAS paired with local window and standard deviation on text summarization.
  • Figure 5: Performance of different eviction policies on language modeling task based on LLaMa2-7B.
  • ...and 1 more figures