Table of Contents
Fetching ...

LUNA: Linear Universal Neural Attention with Generalization Guarantees

Ashkan Shahbazi, Ping He, Ali Abbasi, Yikun Bai, Xinran Liu, Elaheh Akbari, Darian Salehi, Navid NaderiAlizadeh, Soheil Kolouri

TL;DR

The paper tackles the quadratic cost of softmax attention by introducing LUNA, a linear-time attention mechanism built around a fully learnable kernel feature map. By learning input projections, a bank of channel functions, and a tokenwise envelope, LUNA preserves the linear compute pattern while adapting to data-specific inductive biases; it also supports effective post-hoc conversion from pretrained quadratic models. The authors provide a theoretical framework including a Rademacher-based bound and an error decomposition between parametrization and sampling, and they substantiate their approach with state-of-the-art results on Long Range Arena under compute parity and strong post-hoc conversion performance on BERT/GLUE and ViT/ImageNet-1K. Overall, LUNA enables accurate, scalable attention for long sequences and practical deployment in existing systems through minimal fine-tuning after conversion.

Abstract

Scaling attention faces a critical bottleneck: the $\mathcal{O}(n^2)$ quadratic computational cost of softmax attention, which limits its application in long-sequence domains. While linear attention mechanisms reduce this cost to $\mathcal{O}(n)$, they typically rely on fixed random feature maps, such as random Fourier features or hand-crafted functions. This reliance on static, data-agnostic kernels creates a fundamental trade-off, forcing practitioners to sacrifice significant model accuracy for computational efficiency. We introduce \textsc{LUNA}, a kernelized linear attention mechanism that eliminates this trade-off, retaining linear cost while matching and surpassing the accuracy of quadratic attention. \textsc{LUNA} is built on the key insight that the kernel feature map itself should be learned rather than fixed a priori. By parameterizing the kernel, \textsc{LUNA} learns a feature basis tailored to the specific data and task, overcoming the expressive limitations of fixed-feature methods. \textsc{Luna} implements this with a learnable feature map that induces a positive-definite kernel and admits a streaming form, yielding linear time and memory scaling in the sequence length. Empirical evaluations validate our approach across diverse settings. On the Long Range Arena (LRA), \textsc{Luna} achieves state-of-the-art average accuracy among efficient Transformers under compute parity, using the same parameter count, training steps, and approximate FLOPs. \textsc{Luna} also excels at post-hoc conversion: replacing softmax in fine-tuned BERT and ViT-B/16 checkpoints and briefly fine-tuning recovers most of the original performance, substantially outperforming fixed linearizations.

LUNA: Linear Universal Neural Attention with Generalization Guarantees

TL;DR

The paper tackles the quadratic cost of softmax attention by introducing LUNA, a linear-time attention mechanism built around a fully learnable kernel feature map. By learning input projections, a bank of channel functions, and a tokenwise envelope, LUNA preserves the linear compute pattern while adapting to data-specific inductive biases; it also supports effective post-hoc conversion from pretrained quadratic models. The authors provide a theoretical framework including a Rademacher-based bound and an error decomposition between parametrization and sampling, and they substantiate their approach with state-of-the-art results on Long Range Arena under compute parity and strong post-hoc conversion performance on BERT/GLUE and ViT/ImageNet-1K. Overall, LUNA enables accurate, scalable attention for long sequences and practical deployment in existing systems through minimal fine-tuning after conversion.

Abstract

Scaling attention faces a critical bottleneck: the quadratic computational cost of softmax attention, which limits its application in long-sequence domains. While linear attention mechanisms reduce this cost to , they typically rely on fixed random feature maps, such as random Fourier features or hand-crafted functions. This reliance on static, data-agnostic kernels creates a fundamental trade-off, forcing practitioners to sacrifice significant model accuracy for computational efficiency. We introduce \textsc{LUNA}, a kernelized linear attention mechanism that eliminates this trade-off, retaining linear cost while matching and surpassing the accuracy of quadratic attention. \textsc{LUNA} is built on the key insight that the kernel feature map itself should be learned rather than fixed a priori. By parameterizing the kernel, \textsc{LUNA} learns a feature basis tailored to the specific data and task, overcoming the expressive limitations of fixed-feature methods. \textsc{Luna} implements this with a learnable feature map that induces a positive-definite kernel and admits a streaming form, yielding linear time and memory scaling in the sequence length. Empirical evaluations validate our approach across diverse settings. On the Long Range Arena (LRA), \textsc{Luna} achieves state-of-the-art average accuracy among efficient Transformers under compute parity, using the same parameter count, training steps, and approximate FLOPs. \textsc{Luna} also excels at post-hoc conversion: replacing softmax in fine-tuned BERT and ViT-B/16 checkpoints and briefly fine-tuning recovers most of the original performance, substantially outperforming fixed linearizations.

Paper Structure

This paper contains 44 sections, 27 theorems, 118 equations, 4 figures, 5 tables.

Key Result

Proposition 1

The construction in (eq:kernel) yields a positive-definite kernel. Conversely, by Mercer’s theorem, any positive-definite kernel admits such a representation for an appropriate $\phi(\cdot;\omega)$ into an RKHS $\mathcal{H}$. See Appendix sec:kernel_intro.

Figures (4)

  • Figure 1: (a) Softmax attention requires computing all pairwise interactions among tokens, which causes the cost to grow quadratically with the sequence length. (b) LUNA introduces a learnable kernel method for linearizing the attention mechanism, shifting the expensive step from the sequence length $n$ to the feature map size $mL$. (c) For a given set of tokens, LUNA applies $m$ linear projections $W_i \in \mathbb{R}^{d}$, producing $m$ scalar values. Each scalar is then passed through a shared MLP $\psi \colon \mathbb{R} \rightarrow \mathbb{R}^{L}$. By concatenating the $L$ outputs across all $m$ projections, we obtain the kernel feature map $\phi \in \mathbb{R}^{mL}$. The plots on the right show several learned components $\psi_i$ for the LRA-Text task, illustrating that the resulting scalar functions differ from the commonly used fixed choices such as $\tanh$, $\sin$, or $\exp$. The gray band/histogram represents the empirical distribution of $u$.
  • Figure 2: Scaling of linear attention variants. Per-layer runtime as a function of sequence length $n$ (log–log scale) for representative linear-attention baselines and LUNA, measured under identical settings. All methods exhibit approximately linear growth in $n$, with LUNA matching the runtime envelope of existing baselines while using learnable kernel features.
  • Figure 3: Channel-wise visualization of the learned feature map $\phi$ on LRA--Image with $M{=}8$ and $L{=}8$. Each small subplot shows a channel MLP $\psi_\ell(u)$ versus the scalar projection $u{=}w_i^\top x{+}b_i$. Top row: Transformer-layer 0; Bottom row: Transformer-layer 1; Orange curve: $\psi_\ell(u)$ evaluated on real $u$; Gray histogram: the empirical distribution of $u$.
  • Figure 4: CLS-based attention visualizations across vision and language. Top row (ViT-B/16, ImageNet dog sample): CLS attention on images. Columns: Original, Softmax rollout, Hedgehog (diffuse baseline), LUNA (ours). LUNA concentrates on semantically coherent regions (eyes/snout/collar) and suppresses background, while Softmax is scattered and Hedgehog is overly smooth. Bottom row (BERT on SST-2): CLS$\rightarrow$token attention on SST-2. Columns: Original text, Softmax, Hedgehog, LUNA. Tokens are shaded with an alpha proportional to the normalized attention weight from the CLS token (red for Softmax, yellow for Hedgehog, blue for LUNA). LUNA yields compact, sentiment-aligned highlights comparable to Softmax and sharper than Hedgehog.

Theorems & Definitions (54)

  • Proposition 1
  • Remark 1: Neural Approximation of Multiplicatively Decomposable Kernels
  • Proposition 2: Parametrization Error (Informal)
  • proof : Sketch of the proof of Proposition \ref{['pro:error1_main']}
  • Proposition 3: Sampling Error (Informal)
  • proof : Sketch of proof of Proposition \ref{['pro:error2_main']}
  • Remark 2
  • Lemma 1
  • Remark 3
  • proof
  • ...and 44 more