Table of Contents
Fetching ...

BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference

Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He

TL;DR

Buzz is proposed, a novel KV caching algorithm that leverages structured contextual information to minimize cache memory usage while enhancing inference speed and achieves significant inference speedup with a $\log{n}$ time complexity.

Abstract

Large language models (LLMs) are essential in natural language processing but often struggle with inference speed and computational efficiency, limiting real-time deployment. The key-value (KV) cache mechanism reduces computational overhead in transformer models, but challenges in maintaining contextual understanding remain. In this paper, we propose BUZZ, a novel KV caching algorithm that leverages structured contextual information to minimize cache memory usage while enhancing inference speed. BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information and dynamically segmenting historical tokens into chunks to prioritize important tokens in local neighborhoods. We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA. Our results demonstrate that BUZZ (1) reduces cache memory usage by $\textbf{2.5}\times$ in LLM inference while maintaining over 99% accuracy in long-text summarization, and (2) surpasses state-of-the-art performance in multi-document question answering by $\textbf{7.69%}$ under the same memory limit, where full cache methods encounter out-of-memory issues. Additionally, BUZZ achieves significant inference speedup with a $\log{n}$ time complexity. The code is available at https://github.com/JunqiZhao888/buzz-llm.

BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference

TL;DR

Buzz is proposed, a novel KV caching algorithm that leverages structured contextual information to minimize cache memory usage while enhancing inference speed and achieves significant inference speedup with a time complexity.

Abstract

Large language models (LLMs) are essential in natural language processing but often struggle with inference speed and computational efficiency, limiting real-time deployment. The key-value (KV) cache mechanism reduces computational overhead in transformer models, but challenges in maintaining contextual understanding remain. In this paper, we propose BUZZ, a novel KV caching algorithm that leverages structured contextual information to minimize cache memory usage while enhancing inference speed. BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information and dynamically segmenting historical tokens into chunks to prioritize important tokens in local neighborhoods. We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA. Our results demonstrate that BUZZ (1) reduces cache memory usage by in LLM inference while maintaining over 99% accuracy in long-text summarization, and (2) surpasses state-of-the-art performance in multi-document question answering by under the same memory limit, where full cache methods encounter out-of-memory issues. Additionally, BUZZ achieves significant inference speedup with a time complexity. The code is available at https://github.com/JunqiZhao888/buzz-llm.

Paper Structure

This paper contains 25 sections, 2 theorems, 21 equations, 9 figures, 6 tables, 2 algorithms.

Key Result

Theorem 3.1

Maintaining a constant stride $s$ and cache size $C$, the performance of the LLM is expected to be optimal when the following condition is satisfied($T$ denotes the threshold for eviction, $w$ denotes sliding window size):

Figures (9)

  • Figure 1: Illustration of BUZZ vs. existing methods: We visualize sequential decoding steps, where grey blocks represent masked or unseen tokens, and colored blocks represent retained tokens in each respective cache method. Method (a) illustrates the full cache, the default scheme that consumes substantial memory as context length increases. Method (b) dynamically retains heavy-hitters (tokens with high attention scores) outside the window. Method (c) modifies the local method by adding a narrow attention sink, which significantly enhances performance in benchmark experiments. Our method, BUZZ (d), dynamically retains heavy-hitters while preserving the contextual structure.
  • Figure 2: Overview of the local heavy hitter mechanism (BeeHive) in BUZZ: BUZZ approximates the attention scores of middle tokens, extracts local maxima, and evicts the rest. The stride size, a user-defined parameter, controls the granularity of these local neighborhoods.
  • Figure 3: Algorithm Illustration: New tokens are placed in the buffer. Once the total token count reaches the threshold, we sample the new tokens with a large stride and the old tokens with a small stride. After this process, all current tokens become the old tokens, and the buffer is cleared to accommodate new tokens.
  • Figure 4: Model performance under different $\frac{T}{w}$ values. We choose CNNDaily as our datasets, set stride to be 5 and cache size to be around 200.
  • Figure 5: Comparison of methods on summarization ROUGE scores vs KV cache budget. The black dotted line embodies the accuracy achieved by full cache method and is marked by the dotted lines in the graphs.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Remark 1
  • Remark 2
  • Theorem 3.1: Parameter Estimation Theorem
  • Lemma 3.1
  • proof : Proof
  • proof : Proof