Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification
Konstantin Donhauser, Charles Arnal, Mohammad Pezeshki, Vivien Cabannes, David Lopez-Paz, Kartik Ahuja
TL;DR
The paper analyzes long-context attention in decoder-only transformers and discovers two head regimes: local-heads that rely on nearby tokens and long-context heads whose behavior depends on the query. It introduces QAdA, a query-adaptive criterion that uses second-order statistics of keys, via $\mu_K$ and $\\Sigma_K$, to predict which heads require long-context processing without computing full attention, enabling efficient sparsification. Across Llama, Qwen, and Mistral on benchmarks like RULER, LongBench, and long-context reasoning tasks, QAdA matches or exceeds static pruning performance and approaches oracle-like gains, while reducing run-time complexity to $O((1-\\rho) Td + \\rho T_{local} d)$. These results illuminate simple, robust patterns in attention behavior over long sequences and point to practical improvements in efficiency for long-context NLP tasks. The work advances mechanistic understanding of attention while offering a scalable path toward per-query head allocation and faster inference in large language models.
Abstract
The ability to process long contexts is crucial for many natural language processing tasks, yet it remains a significant challenge. While substantial progress has been made in enhancing the efficiency of attention mechanisms, there is still a gap in understanding how attention heads function in long-context settings. In this paper, we observe that while certain heads consistently attend to local information only, others swing between attending to local and long-context information depending on the query. This raises the question: can we identify which heads require long-context information to predict the next token accurately? We demonstrate that it's possible to predict which heads are crucial for long-context processing using only local keys. The core idea here is to exploit a simple model for the long-context scores via second moment approximations. These findings unveil simple properties of attention in the context of long sequences, and open the door to potentially significant gains in efficiency.
