Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification

Konstantin Donhauser; Charles Arnal; Mohammad Pezeshki; Vivien Cabannes; David Lopez-Paz; Kartik Ahuja

Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification

Konstantin Donhauser, Charles Arnal, Mohammad Pezeshki, Vivien Cabannes, David Lopez-Paz, Kartik Ahuja

TL;DR

The paper analyzes long-context attention in decoder-only transformers and discovers two head regimes: local-heads that rely on nearby tokens and long-context heads whose behavior depends on the query. It introduces QAdA, a query-adaptive criterion that uses second-order statistics of keys, via $\mu_K$ and $\\Sigma_K$, to predict which heads require long-context processing without computing full attention, enabling efficient sparsification. Across Llama, Qwen, and Mistral on benchmarks like RULER, LongBench, and long-context reasoning tasks, QAdA matches or exceeds static pruning performance and approaches oracle-like gains, while reducing run-time complexity to $O((1-\\rho) Td + \\rho T_{local} d)$. These results illuminate simple, robust patterns in attention behavior over long sequences and point to practical improvements in efficiency for long-context NLP tasks. The work advances mechanistic understanding of attention while offering a scalable path toward per-query head allocation and faster inference in large language models.

Abstract

The ability to process long contexts is crucial for many natural language processing tasks, yet it remains a significant challenge. While substantial progress has been made in enhancing the efficiency of attention mechanisms, there is still a gap in understanding how attention heads function in long-context settings. In this paper, we observe that while certain heads consistently attend to local information only, others swing between attending to local and long-context information depending on the query. This raises the question: can we identify which heads require long-context information to predict the next token accurately? We demonstrate that it's possible to predict which heads are crucial for long-context processing using only local keys. The core idea here is to exploit a simple model for the long-context scores via second moment approximations. These findings unveil simple properties of attention in the context of long sequences, and open the door to potentially significant gains in efficiency.

Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification

TL;DR

Abstract

Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)