Table of Contents
Fetching ...

Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval

Wenhao Li, Yuxin Zhang, Gen Luo, Haiyuan Wan, Ziyang Gong, Fei Chao, Rongrong Ji

TL;DR

Spotlight Attention tackles the KV cache bottleneck in LLM decoding by replacing traditional random-linear hashing with a learnable non-linear MLP-based hashing, optimized via a Bradley–Terrry ranking loss. The approach yields $128$-bit hash codes, roughly $5$-fold shorter than prior methods, while preserving retrieval quality and enabling substantial throughput gains through dedicated CUDA kernels for bit-packing and NXOR-GEMM. The method demonstrates competitive perplexity and strong fidelity on downstream tasks, including Needle-in-a-Haystack and LongBench, and achieves end-to-end throughput improvements up to $3$-fold on several models. Despite promising results, the reported IoU remains around $40\%$, indicating room for further improvements in hash-based token retrieval. Overall, Spotlight Attention offers a practical pathway to efficient LLM generation by significantly reducing KV-code overhead without sacrificing performance.

Abstract

Reducing the key-value (KV) cache burden in Large Language Models (LLMs) significantly accelerates inference. Dynamically selecting critical KV caches during decoding helps maintain performance. Existing methods use random linear hashing to identify important tokens, but this approach is inefficient due to the orthogonal distribution of queries and keys within two narrow cones in LLMs. We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys, enhancing coding efficiency and robustness. We also developed a lightweight, stable training framework using a Bradley-Terry ranking-based loss, enabling optimization of the non-linear hashing module on GPUs with 16GB memory in 8 hours. Experimental results show that Spotlight Attention drastically improves retrieval precision while shortening the length of the hash code at least 5$\times$ compared to traditional linear hashing. Finally, we exploit the computational advantages of bitwise operations by implementing specialized CUDA kernels, achieving hashing retrieval for 512K tokens in under 100$μ$s on a single A100 GPU, with end-to-end throughput up to 3$\times$ higher than vanilla decoding.

Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval

TL;DR

Spotlight Attention tackles the KV cache bottleneck in LLM decoding by replacing traditional random-linear hashing with a learnable non-linear MLP-based hashing, optimized via a Bradley–Terrry ranking loss. The approach yields -bit hash codes, roughly -fold shorter than prior methods, while preserving retrieval quality and enabling substantial throughput gains through dedicated CUDA kernels for bit-packing and NXOR-GEMM. The method demonstrates competitive perplexity and strong fidelity on downstream tasks, including Needle-in-a-Haystack and LongBench, and achieves end-to-end throughput improvements up to -fold on several models. Despite promising results, the reported IoU remains around , indicating room for further improvements in hash-based token retrieval. Overall, Spotlight Attention offers a practical pathway to efficient LLM generation by significantly reducing KV-code overhead without sacrificing performance.

Abstract

Reducing the key-value (KV) cache burden in Large Language Models (LLMs) significantly accelerates inference. Dynamically selecting critical KV caches during decoding helps maintain performance. Existing methods use random linear hashing to identify important tokens, but this approach is inefficient due to the orthogonal distribution of queries and keys within two narrow cones in LLMs. We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys, enhancing coding efficiency and robustness. We also developed a lightweight, stable training framework using a Bradley-Terry ranking-based loss, enabling optimization of the non-linear hashing module on GPUs with 16GB memory in 8 hours. Experimental results show that Spotlight Attention drastically improves retrieval precision while shortening the length of the hash code at least 5 compared to traditional linear hashing. Finally, we exploit the computational advantages of bitwise operations by implementing specialized CUDA kernels, achieving hashing retrieval for 512K tokens in under 100s on a single A100 GPU, with end-to-end throughput up to 3 higher than vanilla decoding.

Paper Structure

This paper contains 27 sections, 10 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Overview. (Left) Architecture. Comparison of our Spotlight Attention versus normal attention. Spotlight Attention adds an additional hash code-based retrieval mechanism for each layer. (Middle) Performance. Spotlight attention achieves the most accurate retrieval and generates the closest response compared to the original model on QA datasets. (Right) Visualization. For arbitrarily complex attention patterns, our method estimates the top-k sequences well, with an average correctness rate of more than half for different models.
  • Figure 2:
  • Figure 3:
  • Figure 4:
  • Figure 5:
  • ...and 9 more figures