On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference

Siyu Ren; Kenny Q. Zhu

On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference

Siyu Ren, Kenny Q. Zhu

TL;DR

The paper tackles the memory bottleneck of key-value caches in autoregressive LLM inference by analyzing eviction policies through two design dimensions: importance score calculation and eviction-scope construction. It reveals inconsistencies in prior methods and proposes RoCo, which uses Mean Attention Score to assess token importance and a standard-deviation-based eviction scope to robustly select evictable tokens. RoCo demonstrates superior downstream performance across language modeling, summarization, context reconstruction, and instruction following, especially under tight KV budgets, and scales well with larger models and grouped-query attention. The authors also release EasyKV, a practical library enabling KV-constrained inference with configurable budgets and eviction policies.

Abstract

Despite the recent success associated with Large Language Models (LLMs), they are notably cost-prohibitive to deploy in resource-constrained environments due to their excessive memory and computational demands. In addition to model parameters, the key-value cache is also stored in GPU memory, growing linearly with batch size and sequence length. As a remedy, recent works have proposed various eviction policies for maintaining the overhead of key-value cache under a given budget. This paper embarks on the efficacy of existing eviction policies in terms of importance score calculation and eviction scope construction. We identify the deficiency of prior policies in these two aspects and introduce RoCo, a robust cache omission policy based on temporal attention scores and robustness measures. Extensive experimentation spanning prefilling and auto-regressive decoding stages validates the superiority of RoCo. Finally, we release EasyKV, a versatile software package dedicated to user-friendly key-value constrained generative inference. Code available at https://github.com/DRSY/EasyKV.

On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference

TL;DR

Abstract

Paper Structure (46 sections, 2 equations, 6 figures, 9 tables)

This paper contains 46 sections, 2 equations, 6 figures, 9 tables.

Introduction
Background
Transformer-based LLMs
Attention Block
Key-Value Cache
Efficient LLMs
Problem Formulation
Standard Inference
Key-Value Constrained Inference
Eviction Policy for Key-Value Constrained Inference
Importance Score Calculation
Eviction Scope Construction
Preliminary Experiments
Setup
Results
...and 31 more sections

Figures (6)

Figure 1: Illustration of KV cache eviction inside one attention layer ($L$ in total). In this example, a single pair of key-value vectors are deleted (red hatched areas) before appending the next token's. Different heads ($H$ in total) at model layers may evict at different positions.
Figure 2: Consistency of different importance calculation methods w.r.t their full KV cache variant.
Figure 3: Illustration of persistence of attention robustness. We extract attention scores and compute the standard deviation from LLaMa2-7B-Chat.
Figure 4: Results of MAS paired with local window and standard deviation on text summarization.
Figure 5: Performance of different eviction policies on language modeling task based on LLaMa2-7B.
...and 1 more figures

On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference

TL;DR

Abstract

On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (6)