Table of Contents
Fetching ...

SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel

Jose Miguel Luna, Taha Bouhsine, Krzysztof Choromanski

TL;DR

SLAY tackles the quadratic time and memory bottleneck of softmax attention by enforcing unit-norm queries/keys and reexpressing the YAT-kernel as a positive mixture of polynomial and exponential kernels via Bernstein’s integral. The method uses Gauss–Laguerre quadrature and positive random features to obtain a strictly positive, tractable approximation, enabling $O(L)$ attention with stable normalization and bounded scores. The paper provides theoretical guarantees (positivity, boundedness, and PD properties) and extensive empirical evidence across synthetic tasks, extreme classification, and full Transformer models, showing SLAY's performance is near that of full softmax while significantly improving scalability over prior linear-time approaches. Practically, SLAY enables scalable long-context Transformers with geometry-aware interactions, reducing computational costs without sacrificing accuracy, and demonstrates strong performance gains in both language and vision-like benchmarks.

Abstract

We propose a new class of linear-time attention mechanisms based on a relaxed and computationally efficient formulation of the recently introduced E-Product, often referred to as the Yat-kernel (Bouhsine, 2025). The resulting interactions are geometry-aware and inspired by inverse-square interactions in physics. Our method, Spherical Linearized Attention with Yat Kernels (SLAY), constrains queries and keys to the unit sphere so that attention depends only on angular alignment. Using Bernstein's theorem, we express the spherical Yat-kernel as a nonnegative mixture of polynomial-exponential product kernels and derive a strictly positive random-feature approximation enabling linear-time O(L) attention. We establish positive definiteness and boundedness on the sphere and show that the estimator yields well-defined, nonnegative attention scores. Empirically, SLAY achieves performance that is nearly indistinguishable from standard softmax attention while retaining linear time and memory scaling, and consistently outperforms prior linear-time attention mechanisms such as Performers and Cosformers. To the best of our knowledge, SLAY represents the closest linear-time approximation to softmax attention reported to date, enabling scalable Transformers without the typical performance trade-offs of attention linearization.

SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel

TL;DR

SLAY tackles the quadratic time and memory bottleneck of softmax attention by enforcing unit-norm queries/keys and reexpressing the YAT-kernel as a positive mixture of polynomial and exponential kernels via Bernstein’s integral. The method uses Gauss–Laguerre quadrature and positive random features to obtain a strictly positive, tractable approximation, enabling attention with stable normalization and bounded scores. The paper provides theoretical guarantees (positivity, boundedness, and PD properties) and extensive empirical evidence across synthetic tasks, extreme classification, and full Transformer models, showing SLAY's performance is near that of full softmax while significantly improving scalability over prior linear-time approaches. Practically, SLAY enables scalable long-context Transformers with geometry-aware interactions, reducing computational costs without sacrificing accuracy, and demonstrates strong performance gains in both language and vision-like benchmarks.

Abstract

We propose a new class of linear-time attention mechanisms based on a relaxed and computationally efficient formulation of the recently introduced E-Product, often referred to as the Yat-kernel (Bouhsine, 2025). The resulting interactions are geometry-aware and inspired by inverse-square interactions in physics. Our method, Spherical Linearized Attention with Yat Kernels (SLAY), constrains queries and keys to the unit sphere so that attention depends only on angular alignment. Using Bernstein's theorem, we express the spherical Yat-kernel as a nonnegative mixture of polynomial-exponential product kernels and derive a strictly positive random-feature approximation enabling linear-time O(L) attention. We establish positive definiteness and boundedness on the sphere and show that the estimator yields well-defined, nonnegative attention scores. Empirically, SLAY achieves performance that is nearly indistinguishable from standard softmax attention while retaining linear time and memory scaling, and consistently outperforms prior linear-time attention mechanisms such as Performers and Cosformers. To the best of our knowledge, SLAY represents the closest linear-time approximation to softmax attention reported to date, enabling scalable Transformers without the typical performance trade-offs of attention linearization.
Paper Structure (73 sections, 8 theorems, 44 equations, 21 figures, 9 tables, 1 algorithm)

This paper contains 73 sections, 8 theorems, 44 equations, 21 figures, 9 tables, 1 algorithm.

Key Result

Lemma 1

For $\epsilon > 0$ and $C = 2 + \epsilon$, the variable $y = C - 2x$ satisfies $y \geq \epsilon > 0$ for all $x \in [-1,1]$. Hence Bernstein's representation $1/(C - 2x) = \int_0^\infty e^{-s(C-2x)}\,ds$ applies throughout the domain.

Figures (21)

  • Figure 1: Each panel shows how a kernel partitions the 2D feature space among 5 randomly placed neurons (stars). (a) Linear dot product softmax. (b) FAVOR+ (ReLU random features). (c) Linear with ELU+1. (d) Exact ⵟ-kernel. (e) Spherical ⵟ-kernel. (f) SLAY (Anchor)
  • Figure 2: Attention mechanisms scaling behaviors. Several variants are compared: brute-force regular attention (Standard), YAT-attention (YAT), linear attention (ELU+1), Cosformer attention from zhen2022cosformer, Performer's attention from performers (FAVOR+), and SLAY's attention.
  • Figure 3: Validation loss (top) and validation perplexity (bottom) as a function of training steps after satisfying the Chinchilla scaling law with 125M parameters and 2.5B tokens.
  • Figure 4: Kernel response as a function of alignment $x=\hat{q}^\top\hat{k}$. The spherical $\text{\normalfont\tifinaghfont ⵟ}$-kernel is bounded and self-regularizing, unlike softmax which grows without bound.
  • Figure 5: Kernel response vs. angular distance.
  • ...and 16 more figures

Theorems & Definitions (18)

  • Remark 1: Feature Map Target
  • Remark 2: Bias Decomposition
  • Lemma 1: Bernstein Representation Applicability
  • proof
  • Proposition 1: Geometric Origin
  • Remark 3: Geometric Invariances
  • Proposition 2: PRF Unbiasedness
  • proof
  • Lemma 2: Positive Mixture Closure
  • Theorem 1: Tensor Kernel Decomposition
  • ...and 8 more