Table of Contents
Fetching ...

Long-Context Generalization with Sparse Attention

Pavlo Vasylenko, Hugo Pitorro, André F. T. Martins, Marcos Treviso

TL;DR

This paper introduces Adaptive-Scalable Entmax (ASEntmax), which endows $\alpha$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes.

Abstract

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using $α$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $α$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature $α$-entmax baselines, achieving up to 1000$\times$ length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8$\times$ training length.

Long-Context Generalization with Sparse Attention

TL;DR

This paper introduces Adaptive-Scalable Entmax (ASEntmax), which endows -entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes.

Abstract

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using -entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows -entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature -entmax baselines, achieving up to 1000 length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8 training length.

Paper Structure

This paper contains 77 sections, 10 theorems, 105 equations, 13 figures, 17 tables.

Key Result

lemma 1

Consider scalars $a_1,...,a_{n-1},c\in\mathbb{R}$. Let $\bm{x} = [a_1,...,a_{n-1},c]^\top \in\mathbb{R}^n$ and $\bm{x}^* = [a_1,...,a_{n-1},b,c]^\top \in \mathbb{R}^{n+1}$, with all entries bounded. The following properties hold: Furthermore, for $\alpha > 1$, the difference $\alpha\text{-entmax}(\bm{x})_{n} - \alpha\text{-entmax}(\bm{x}^*)_{n+1}$ can take any value in $[0, \alpha\text{-entmax}(\

Figures (13)

  • Figure 1: Long-context generalization on Multi-query Multi-token Associative Recall (left) and Max Retrieval (right). SSMax represents the Scalable Softmax approach by nakanishi2025scalable, and Adaptive Temperature (Adapt. Temp.) represents the approach by velivckovic2024softmax. While all methods benefit from using NAPE (NoPE + ALiBi), our adaptive-scaling version of $\alpha$-entmax exhibits the best extrapolation results, effectively handling extremely long sequences.
  • Figure 2: Visualization of $\alpha\text{-entmax}(\bm{z}/\theta)$ for different values of $\alpha$. Each panel shows how probability mass is distributed among five elements of $\bm{z} = [2.0, 1.8, 1.6, 1.4, 1.2]$ as the temperature parameter decreases ($\theta^{-1}$ increases). The vertical lines show the temperature that leads to zero probability.
  • Figure 3: Learned positions per head. Besides a simple linear fit baseline ($\beta n$), we also show the fit given by $\delta + \beta \log n$ and $\delta + \beta (\log n)^\gamma$, which are used by SSMax and ASEntmax, respectively. These plots further reinforce the idea that having a $\gamma$ parameter is beneficial for length extrapolation.
  • Figure 4: Example of attention weight profiles for different positional encodings with $\alpha$-entmax with $\alpha=1.5$. NoPE induces content-driven sparsity. ALiBi induces attention windows with a clear cutoff. RoPE promotes frequency-dependent patterns with potential periodic dead zones.
  • Figure 5: Visualization of $\alpha$-entmax for different values of $\alpha$. We also include top-$k$ softmax with $k=2$ for completeness. Each panel shows how $p_0$ varies for the input $\bm{z} = [0, z_1, z_2]$.
  • ...and 8 more figures

Theorems & Definitions (20)

  • lemma 1: Non-Vanishing Attention Property
  • definition 1: Attention Dispersion
  • proposition 1: Dispersion Properties of Attention Mechanisms
  • proposition 2: Representational Preservation and Reduced Gradient Paths
  • lemma 2: Threshold Behavior for Two-level Logits
  • proof
  • proof
  • proposition 3: Counterexample to Representational Collapse with $\alpha$-entmax
  • proof
  • proposition 4: Over-squashing Alleviation with $\alpha$-entmax
  • ...and 10 more