Table of Contents
Fetching ...

ThinK: Thinner Key Cache by Query-Driven Pruning

Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo

TL;DR

ThinK addresses the KV-cache memory bottleneck in long-context LLM inference by pruning KV cache channels in the head dimension using a query-dependent criterion. It minimizes attention-weight loss through a per-head interaction score and greedy channel selection, enabling substantial memory savings while preserving or improving accuracy. The approach is plug-and-play and synergizes with existing KV-cache eviction and quantization methods, delivering up to 2.8x peak-memory reduction with KIVI and enabling up to 5x larger batch sizes on a single GPU. Across LongBench and Needle-in-a-Haystack, ThinK demonstrates robust performance gains and provides a practical baseline for efficient deployment of large-scale models like LLaMA and Mistral in long-context settings.

Abstract

Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. However, their increased computational and memory demands present significant challenges, especially when handling long sequences. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. Unlike existing approaches that optimize the memory based on the sequence length, we identify substantial redundancy in the channel dimension of the KV cache, as indicated by an uneven magnitude distribution and a low-rank structure in the attention weights. In response, we propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels. Our approach not only maintains or enhances model accuracy but also achieves a reduction in KV cache memory costs by over 20% compared with vanilla KV cache eviction and quantization methods. For instance, ThinK integrated with KIVI can achieve a 2.8x reduction in peak memory usage while maintaining nearly the same quality, enabling up to a 5x increase in batch size when using a single GPU. Extensive evaluations on the LLaMA and Mistral models across various long-sequence datasets verified the efficiency of ThinK, establishing a new baseline algorithm for efficient LLM deployment without compromising performance. Our code has been made available at https://github.com/SalesforceAIResearch/ThinK.

ThinK: Thinner Key Cache by Query-Driven Pruning

TL;DR

ThinK addresses the KV-cache memory bottleneck in long-context LLM inference by pruning KV cache channels in the head dimension using a query-dependent criterion. It minimizes attention-weight loss through a per-head interaction score and greedy channel selection, enabling substantial memory savings while preserving or improving accuracy. The approach is plug-and-play and synergizes with existing KV-cache eviction and quantization methods, delivering up to 2.8x peak-memory reduction with KIVI and enabling up to 5x larger batch sizes on a single GPU. Across LongBench and Needle-in-a-Haystack, ThinK demonstrates robust performance gains and provides a practical baseline for efficient deployment of large-scale models like LLaMA and Mistral in long-context settings.

Abstract

Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. However, their increased computational and memory demands present significant challenges, especially when handling long sequences. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. Unlike existing approaches that optimize the memory based on the sequence length, we identify substantial redundancy in the channel dimension of the KV cache, as indicated by an uneven magnitude distribution and a low-rank structure in the attention weights. In response, we propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels. Our approach not only maintains or enhances model accuracy but also achieves a reduction in KV cache memory costs by over 20% compared with vanilla KV cache eviction and quantization methods. For instance, ThinK integrated with KIVI can achieve a 2.8x reduction in peak memory usage while maintaining nearly the same quality, enabling up to a 5x increase in batch size when using a single GPU. Extensive evaluations on the LLaMA and Mistral models across various long-sequence datasets verified the efficiency of ThinK, establishing a new baseline algorithm for efficient LLM deployment without compromising performance. Our code has been made available at https://github.com/SalesforceAIResearch/ThinK.
Paper Structure (21 sections, 5 equations, 7 figures, 12 tables)

This paper contains 21 sections, 5 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: An illustration of the pruning procedure of ThinK. Within each attention head, scores are computed for each channel, and only the top $T$ channels out of $D$ are selected for retention. A binary channel mask, along with the pruned keys, is then stored in the cache memory.
  • Figure 2: Implementation during decoding.
  • Figure 3: (a) presents the performance comparison with token eviction methods under identical memory usage for Mistral-7B-Instruct-v0.2, while (b) illustrates the memory usage comparison with the KV cache quantization method KIVI across different batch sizes for LLaMA-2-7B-chat. ThinK ($0.4$) indicates we prune the key cache channels with a pruning ratio of $\lambda=0.4$.
  • Figure 4: Magnitude of key and value cache for LLaMA-2-7B. The first head of layer $14$ and layer $20$ of LLaMA-2-7B is selected to visualize the magnitude of the key and value caches. We observe that the magnitudes of the key cache channels vary differently, whereas the channels of the value cache do not exhibit such variation.
  • Figure 5: The energy and cumulative energy of the singular values.
  • ...and 2 more figures