Table of Contents
Fetching ...

Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression

Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen, Yunhe Wang

TL;DR

This work tackles the challenge of long-context processing in large language models by addressing both Out-Of-Distribution extrapolation and quadratic attention costs. It introduces Efficient Selective Attention (ESA), a parameter-free method that selects a fixed number of crucial tokens at the token level using query-key compression and a proximity-influence mechanism to preserve semantic continuity, enabling context lengths well beyond pretraining without retraining. ESA achieves competitive accuracy with full-attention extrapolation on diverse long-context benchmarks (LongBench, ∞BENCH, NeedleBench, Counting-Stars) across Mistral and Llama models, while significantly reducing computation (down to about $1.56\%$ of the original per-step cost in some settings). The approach relies on a calibration-based offline learning of low-dimensional projections and modest KV-cache augmentation, offering a practical path to efficient long-context inference with minimal engineering changes.

Abstract

Handling long-context sequences efficiently remains a significant challenge in large language models (LLMs). Existing methods for token selection in sequence extrapolation either employ a permanent eviction strategy or select tokens by chunk, which may lead to the loss of critical information. We propose Efficient Selective Attention (ESA), a novel approach that extends context length by efficiently selecting the most critical tokens at the token level to compute attention. ESA reduces the computational complexity of token selection by compressing query and key vectors into lower-dimensional representations. We evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using open-source LLMs with context lengths of 8k and 32k. ESA outperforms other selective attention methods, especially in tasks requiring the retrieval of multiple pieces of information, achieving comparable performance to full-attention extrapolation methods across various tasks, with superior results in certain tasks.

Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression

TL;DR

This work tackles the challenge of long-context processing in large language models by addressing both Out-Of-Distribution extrapolation and quadratic attention costs. It introduces Efficient Selective Attention (ESA), a parameter-free method that selects a fixed number of crucial tokens at the token level using query-key compression and a proximity-influence mechanism to preserve semantic continuity, enabling context lengths well beyond pretraining without retraining. ESA achieves competitive accuracy with full-attention extrapolation on diverse long-context benchmarks (LongBench, ∞BENCH, NeedleBench, Counting-Stars) across Mistral and Llama models, while significantly reducing computation (down to about of the original per-step cost in some settings). The approach relies on a calibration-based offline learning of low-dimensional projections and modest KV-cache augmentation, offering a practical path to efficient long-context inference with minimal engineering changes.

Abstract

Handling long-context sequences efficiently remains a significant challenge in large language models (LLMs). Existing methods for token selection in sequence extrapolation either employ a permanent eviction strategy or select tokens by chunk, which may lead to the loss of critical information. We propose Efficient Selective Attention (ESA), a novel approach that extends context length by efficiently selecting the most critical tokens at the token level to compute attention. ESA reduces the computational complexity of token selection by compressing query and key vectors into lower-dimensional representations. We evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using open-source LLMs with context lengths of 8k and 32k. ESA outperforms other selective attention methods, especially in tasks requiring the retrieval of multiple pieces of information, achieving comparable performance to full-attention extrapolation methods across various tasks, with superior results in certain tasks.

Paper Structure

This paper contains 36 sections, 12 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: (a) In long-context scenarios, the number of middle tokens occupies the majority, while the lengths of the other three parts of tokens are fixed. The importance scores between current tokens and middle tokens are utilized to select the top-k middle tokens. The selected tokens replace the middle tokens for computing attention. (b) The queries from current tokens and keys from middle tokens are compressed into smaller tensors through a linear layer respectively. The dot product of the compressed queries and keys serves as the importance scores. (c) The priority of a middle token being selected is determined by the maximum importance score among itself and several surrounding tokens.
  • Figure 2: Recall rates of each layer for selecting the top 2,000 tokens after dimensionality reduction.