Table of Contents
Fetching ...

No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization

June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee

TL;DR

This work tackles the memory bottleneck of KV caches in autoregressive LLM inference by revealing that eviction-based cache compression can cause safety breaches, incoherence, and hallucinations due to loss of contextual information. It introduces MiKV, a mixed-precision KV cache that retains evicted KVs in low precision while preserving important KVs in high precision, complemented by dynamic outlier-aware quantization and acceleration techniques. Across diverse benchmarks and backbones, MiKV achieves state-of-the-art memory-accuracy trade-offs, enabling up to substantial compression without sacrificing generation quality, and demonstrating robustness on tasks like AlpacaEval. The approach highlights practical memory savings for deployment and emphasizes safety considerations in cache-enabled inference.

Abstract

Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models~(LLMs). However, the memory footprint of the KV cache poses a critical bottleneck in LLM deployment as the cache size grows with batch size and sequence length, often surpassing even the size of the model itself. Although recent methods were proposed to select and evict unimportant KV pairs from the cache to reduce memory consumption, the potential ramifications of eviction on the generative process are yet to be thoroughly examined. In this paper, we examine the detrimental impact of cache eviction and observe that unforeseen risks arise as the information contained in the KV pairs is exhaustively discarded, resulting in safety breaches, hallucinations, and context loss. Surprisingly, we find that preserving even a small amount of information contained in the evicted KV pairs via reduced precision quantization substantially recovers the incurred degradation. On the other hand, we observe that the important KV pairs must be kept at a relatively higher precision to safeguard the generation quality. Motivated by these observations, we propose \textit{Mixed-precision KV cache}~(MiKV), a reliable cache compression method that simultaneously preserves the context details by retaining the evicted KV pairs in low-precision and ensure generation quality by keeping the important KV pairs in high-precision. Experiments on diverse benchmarks and LLM backbones show that our proposed method offers a state-of-the-art trade-off between compression ratio and performance, compared to other baselines.

No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization

TL;DR

This work tackles the memory bottleneck of KV caches in autoregressive LLM inference by revealing that eviction-based cache compression can cause safety breaches, incoherence, and hallucinations due to loss of contextual information. It introduces MiKV, a mixed-precision KV cache that retains evicted KVs in low precision while preserving important KVs in high precision, complemented by dynamic outlier-aware quantization and acceleration techniques. Across diverse benchmarks and backbones, MiKV achieves state-of-the-art memory-accuracy trade-offs, enabling up to substantial compression without sacrificing generation quality, and demonstrating robustness on tasks like AlpacaEval. The approach highlights practical memory savings for deployment and emphasizes safety considerations in cache-enabled inference.

Abstract

Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models~(LLMs). However, the memory footprint of the KV cache poses a critical bottleneck in LLM deployment as the cache size grows with batch size and sequence length, often surpassing even the size of the model itself. Although recent methods were proposed to select and evict unimportant KV pairs from the cache to reduce memory consumption, the potential ramifications of eviction on the generative process are yet to be thoroughly examined. In this paper, we examine the detrimental impact of cache eviction and observe that unforeseen risks arise as the information contained in the KV pairs is exhaustively discarded, resulting in safety breaches, hallucinations, and context loss. Surprisingly, we find that preserving even a small amount of information contained in the evicted KV pairs via reduced precision quantization substantially recovers the incurred degradation. On the other hand, we observe that the important KV pairs must be kept at a relatively higher precision to safeguard the generation quality. Motivated by these observations, we propose \textit{Mixed-precision KV cache}~(MiKV), a reliable cache compression method that simultaneously preserves the context details by retaining the evicted KV pairs in low-precision and ensure generation quality by keeping the important KV pairs in high-precision. Experiments on diverse benchmarks and LLM backbones show that our proposed method offers a state-of-the-art trade-off between compression ratio and performance, compared to other baselines.
Paper Structure (32 sections, 3 equations, 15 figures, 6 tables)

This paper contains 32 sections, 3 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Safety breaches induced by 50% KV cache eviction (H2O; zhang2023h2o) in Llama-2-7b-chat.
  • Figure 2: Observed contextual incoherency and hallucination induced by 50% KV cache eviction (H2O).
  • Figure 3: Line retrieval performance of KV cache eviction (H2O), oracle eviction, and mixed-precision KV cache (MiKV).
  • Figure 4: The figure illustrates the process of performing self-attention operations using MiKV during the generation phase. Blue boxes represent the parts that remain unchanged from the conventional method, while red-bordered boxes depict the logic incorporating MiKV's proposed enhancements. Left: it shows the self-attention operation method of MiKV at the current t-th Generation step. Right: it demonstrates how $K$ and $V$ tokens at the t-th step are differentiated into Important tokens and Retained tokens in MiKV. Moreover, it indicates that MiKV can apply the token importance policies proposed in existing approaches like zhang2023h2o or ge2024model.
  • Figure 5: Manifested outliers in both keys and queries for multiple layers in Llama-2-7b-chat. More outlier plots for layers and backbones are provided in Appendix \ref{['appendix:qkv']}.
  • ...and 10 more figures