Table of Contents
Fetching ...

Fractional neural attention for efficient multiscale sequence processing

Cheng Kevin Qu, Andrew Ly, Pulin Gong

TL;DR

Fractional Neural Attention is introduced, a principled mechanism connecting self-attention, stochastic dynamics, and geometry, providing an interpretable, biologically grounded foundation for powerful, neuroscience-inspired AI.

Abstract

Attention mechanisms underpin the computational power of Transformer models, which have achieved remarkable success across diverse domains. Yet understanding and extending the principles underlying self-attention remains a key challenge for advancing artificial intelligence. Drawing inspiration from the multiscale dynamics of biological attention and from dynamical systems theory, we introduce Fractional Neural Attention (FNA), a principled, neuroscience-inspired framework for multiscale information processing. FNA models token interactions through Lévy diffusion governed by the fractional Laplacian, intrinsically realizing simultaneous short- and long-range dependencies across multiple scales. This mechanism yields greater expressivity and faster information mixing, advancing the foundational capacity of Transformers. Theoretically, we show that FNA's dynamics are governed by the fractional diffusion equation, and that the resulting attention networks exhibit larger spectral gaps and shorter path lengths -- mechanistic signatures of enhanced computational efficiency. Empirically, FNA achieves competitive text-classification performance even with a single layer and a single head; it also improves performance in image processing and neural machine translation. Finally, the diffusion map algorithm from geometric harmonics enables dimensionality reduction of FNA weights while preserving the intrinsic structure of embeddings and hidden states. Together, these results establish FNA as a principled mechanism connecting self-attention, stochastic dynamics, and geometry, providing an interpretable, biologically grounded foundation for powerful, neuroscience-inspired AI.

Fractional neural attention for efficient multiscale sequence processing

TL;DR

Fractional Neural Attention is introduced, a principled mechanism connecting self-attention, stochastic dynamics, and geometry, providing an interpretable, biologically grounded foundation for powerful, neuroscience-inspired AI.

Abstract

Attention mechanisms underpin the computational power of Transformer models, which have achieved remarkable success across diverse domains. Yet understanding and extending the principles underlying self-attention remains a key challenge for advancing artificial intelligence. Drawing inspiration from the multiscale dynamics of biological attention and from dynamical systems theory, we introduce Fractional Neural Attention (FNA), a principled, neuroscience-inspired framework for multiscale information processing. FNA models token interactions through Lévy diffusion governed by the fractional Laplacian, intrinsically realizing simultaneous short- and long-range dependencies across multiple scales. This mechanism yields greater expressivity and faster information mixing, advancing the foundational capacity of Transformers. Theoretically, we show that FNA's dynamics are governed by the fractional diffusion equation, and that the resulting attention networks exhibit larger spectral gaps and shorter path lengths -- mechanistic signatures of enhanced computational efficiency. Empirically, FNA achieves competitive text-classification performance even with a single layer and a single head; it also improves performance in image processing and neural machine translation. Finally, the diffusion map algorithm from geometric harmonics enables dimensionality reduction of FNA weights while preserving the intrinsic structure of embeddings and hidden states. Together, these results establish FNA as a principled mechanism connecting self-attention, stochastic dynamics, and geometry, providing an interpretable, biologically grounded foundation for powerful, neuroscience-inspired AI.

Paper Structure

This paper contains 6 sections, 3 theorems, 45 equations, 12 figures, 3 tables.

Key Result

Proposition 1

Assume $\mathcal{M} = \mathbb{R}^d$ or $\mathbb{S}^{d - 1}$, single-head attention $H = 1$ and $\mathbf{W}_{Q,K} \in O(d)$ (i.e., Assumption 1). Then, self-attention (equation eq:sa_resnet) and FNA (equation eq:frac_attn_score) with query and key projection matrices $\mathbf{W}_Q$ and $\mathbf{W}_K$

Figures (12)

  • Figure 1: From fractional diffusion to attention. The fractional diffusion equation describes the density evolution of Lévy processes for $\alpha < 2$. When $\alpha = 2$, it reduces to the classical diffusion equation corresponding to Brownian motion. The associated step-size distributions are heavy-tailed (i.e., multiscale) for $\alpha < 2$ and Gaussian for $\alpha = 2$. Inspired by the role of fractional diffusion in neurobiological attention Chen2022, we incorporate these dynamics into Transformer self-attention to obtain fractional neural attention (FNA). Multiscale FNA ($\alpha < 2$) enhances the expressivity of attention, enabling the first word of a sentence (e.g., "All children, except one, grow up."---Peter Pan) to attend to the last with a single "step" of the attention mechanism. In contrast, local attention ($\alpha = 2$) and standard attention typically require multiple steps. Steps are represented by arrows within the attention matrices.
  • Figure 2: FNA token interactions and spectral characteristics.a Solid lines show eigenvalues of the FNA attention weight matrix, ordered from small to large. Dashed lines represent eye guides with slopes of $j^{1.2}$ (blue) and $j^2$ (red) respectively where $j$ corresponds to the eigenvalue index. b The gray-shaded circles represent the randomly sampled token embeddings. The thickness of the blue lines reflects the strength of the fractional attention weights, between connected queries and keys, scaled by a constant. Only connection strengths above $3.12 \times 10^{-5}$ are presented. c Same as in b but for $\alpha = 2$ with the red lines representing the attention weight strengths.
  • Figure 3: Effects of embedding dimension and depth.a Accuracy on the testing dataset across embedding dimensions $d$ and depth $L = 1$ with $\mathbf{Q} \neq \mathbf{K}$. Dots and error bars show the mean and standard deviation across five trials. b Same as a, using $\mathbf{Q} = \mathbf{K}$.
  • Figure 4: Ablation of nodes in the attention graph. Accuracy on the testing dataset after randomly ablating nodes from the attention graph with probability $p$. Lines show the mean across five trials for each network with $\textbf{Q} = \textbf{K}$. Shaded regions indicate the standard deviation. a$d=8$. b$d=16$. c$d=32$. d$d=64$.
  • Figure 5: Graph-theoretic analysis of fractional neural attention.a Mean test accuracy across five trials versus the mean spectral gap computed on a random IMDb subset of size 100. Error bars show the standard deviation. b Diffusion map visualization with $m = 2$. c Attention between two related tokens: "much" and "danger". The shortest paths for DP, $\alpha=2$ and $\alpha=1.2$ are 13, 2 and 1, respectively. Edge widths are proportional to the reciprocal of the attention weight. d Shortest-path lengths between every pair of tokens in the same sequence for multiscale FNA ($\alpha = 1.2$), local attention ($\alpha = 2$) and DP. e Dots show the mean shortest path length between each pair of token in a random sequence of the specific length. Error bars represent 0.2 standard deviations (scaled for clarity).
  • ...and 7 more figures

Theorems & Definitions (5)

  • Proposition 1
  • Theorem 2: PDE for FNA
  • proof
  • Lemma 1
  • proof