Table of Contents
Fetching ...

TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization

Dingyu Yao, Bowen Shen, Zheng Lin, Wei Liu, Jian Luan, Bin Wang, Weiping Wang

TL;DR

This work addresses the memory and latency bottlenecks of KV caches in long-context LLM inference by proposing TailorKV, a hybrid framework that classifies Transformer layers into quantization-friendly and sparsity-friendly groups. It combines aggressive 1-bit quantization for quantization-friendly layers with dynamic Top-K token retrieval for sparsity-friendly layers, enabled by offline layer identification and asynchronous CPU-GPU co-execution. The approach yields substantial memory reductions with near-lossless accuracy across LongBench, InfiniteBench, and RULER benchmarks, achieving practical latency performance (e.g., 82 ms per token for Llama-3.1-8B with 128k context on a single RTX 3090) and enabling long-context serving on resource-constrained GPUs. The results demonstrate the viability of layer-aware compression to extend the reach of LLMs to devices with limited memory, while offering a hardware-friendly design and strong empirical gains over state-of-the-art methods.

Abstract

The Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to PCIe bandwidth bottlenecks in CPU-GPU communication, while aggressive compression causes notable performance degradation. We identify that certain layers in the LLM need to maintain global information and are unsuitable for selective loading. In contrast, other layers primarily focus on a few tokens with dominant activations that potentially incur substantial quantization error. This observation leads to a key insight that loading dominant tokens and quantizing all tokens can complement each other. Building on this insight, we propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading. TailorKV develops an inference framework along with a hardware-friendly implementation that leverages these complementary characteristics. Extensive long-context evaluations exhibit that TailorKV achieves nearly lossless performance under aggressive compression settings, outperforming the state-of-the-art. Particularly, the Llama-3.1-8B with 128k context can be served within a single RTX 3090 GPU, reaching 82 ms per token during decoding.

TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization

TL;DR

This work addresses the memory and latency bottlenecks of KV caches in long-context LLM inference by proposing TailorKV, a hybrid framework that classifies Transformer layers into quantization-friendly and sparsity-friendly groups. It combines aggressive 1-bit quantization for quantization-friendly layers with dynamic Top-K token retrieval for sparsity-friendly layers, enabled by offline layer identification and asynchronous CPU-GPU co-execution. The approach yields substantial memory reductions with near-lossless accuracy across LongBench, InfiniteBench, and RULER benchmarks, achieving practical latency performance (e.g., 82 ms per token for Llama-3.1-8B with 128k context on a single RTX 3090) and enabling long-context serving on resource-constrained GPUs. The results demonstrate the viability of layer-aware compression to extend the reach of LLMs to devices with limited memory, while offering a hardware-friendly design and strong empirical gains over state-of-the-art methods.

Abstract

The Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to PCIe bandwidth bottlenecks in CPU-GPU communication, while aggressive compression causes notable performance degradation. We identify that certain layers in the LLM need to maintain global information and are unsuitable for selective loading. In contrast, other layers primarily focus on a few tokens with dominant activations that potentially incur substantial quantization error. This observation leads to a key insight that loading dominant tokens and quantizing all tokens can complement each other. Building on this insight, we propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading. TailorKV develops an inference framework along with a hardware-friendly implementation that leverages these complementary characteristics. Extensive long-context evaluations exhibit that TailorKV achieves nearly lossless performance under aggressive compression settings, outperforming the state-of-the-art. Particularly, the Llama-3.1-8B with 128k context can be served within a single RTX 3090 GPU, reaching 82 ms per token during decoding.

Paper Structure

This paper contains 49 sections, 13 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: Observations on attention. (a) Attention weights on Llama-2-7B-32K-Instruct. Detailed visualizations are in \ref{['appendix:detail_attention']}. (b) Sparse error of different models on the 2WikiMQA dataset, with only the top 5% of attention scores retained. (c) Sparse error on different datasets, with only the top 5% of attention scores retained.
  • Figure 2: (Top) Query and key in Llama-3.1-8B-Instruct show outlier patterns in some channels, while the value shows no outliers. (Bottom) The number of times reaching the Top-8. Outliers may appear in any position.
  • Figure 3: System overview of TailorKV. Offline identification categorizes the layers into quantization-friendly and sparsity-friendly. For quantization-friendly layers, we employ aggressive static quantization. For sparsity-friendly layers, we dynamically retrieve Top-K tokens. Critical current query and critical key cache represent the outliers in the query and key cache, respectively.
  • Figure 4: Two-stage dynamic retrieval process: Stage 1 estimates critical channels at layer $l-1$ and prefetches critical key cache for layer $l$. Stage 2 approximates attention scores and selects Top-K tokens at layer $l$.
  • Figure 5: Timeline of dynamic retrieval. Blue signifies computation and pink signifies communication.
  • ...and 9 more figures