Table of Contents
Fetching ...

Fast Multipole Attention: A Scalable Multilevel Attention Mechanism for Text and Images

Yanming Kang, Giang Tran, Hans De Sterck

TL;DR

The paper tackles the quadratic bottleneck of self-attention by introducing Fast Multipole Attention (FMA), a multilevel, physics-inspired mechanism that blends near-field exact interactions with far-field learned aggregations. By organizing attention in a hierarchical 1D (and 2D) structure, FMA achieves O(n log n) complexity (or O(n) with query downsampling) while preserving global context for long sequences and high-resolution images. Empirically, FMA outperforms strong efficient-attention baselines on autoregressive and bidirectional language benchmarks and surpasses vision transformers on ImageNet-1K and ADE20K while using comparable or lower memory. The approach is implemented with GPU-optimized kernels and opens avenues for 3D extensions and hybrid architectures.

Abstract

While Transformer networks benefit from a global receptive field, their quadratic cost relative to sequence length restricts their application to long sequences and high-resolution inputs. We introduce Fast Multipole Attention (FMA), a divide-and-conquer mechanism for self-attention inspired by the Fast Multipole Method from n-body physics. FMA reduces the time and memory complexity of self-attention from $\mathcal{O}\left(n^2\right)$ to $\mathcal{O}(n \log n)$ and $\mathcal{O}(n)$ while preserving full-context interactions. FMA contains a learned hierarchy with $\mathcal{O}(\log n)$ levels of resolution. In this hierarchy, nearby tokens interact at full resolution, while distant tokens engage through progressively coarser, learned basis functions. We have developed both 1D and 2D implementations of FMA for language and vision tasks, respectively. On autoregressive and bidirectional language modeling benchmarks, the 1D variant either matches or outperforms leading efficient attention baselines with substantially lower memory use. With linear complexity, the 2D variant demonstrates superior performance over strong vision transformer baselines in classification and semantic segmentation tasks. Our results confirm that the multilevel attention implemented by FMA allows Transformer-based models to scale to much longer sequences and higher-resolution inputs without loss in accuracy. This provides a principled, physics-inspired approach for developing scalable neural networks suitable for language, vision, and multimodal tasks. Our code will be available at https://github.com/epoch98/FMA.

Fast Multipole Attention: A Scalable Multilevel Attention Mechanism for Text and Images

TL;DR

The paper tackles the quadratic bottleneck of self-attention by introducing Fast Multipole Attention (FMA), a multilevel, physics-inspired mechanism that blends near-field exact interactions with far-field learned aggregations. By organizing attention in a hierarchical 1D (and 2D) structure, FMA achieves O(n log n) complexity (or O(n) with query downsampling) while preserving global context for long sequences and high-resolution images. Empirically, FMA outperforms strong efficient-attention baselines on autoregressive and bidirectional language benchmarks and surpasses vision transformers on ImageNet-1K and ADE20K while using comparable or lower memory. The approach is implemented with GPU-optimized kernels and opens avenues for 3D extensions and hybrid architectures.

Abstract

While Transformer networks benefit from a global receptive field, their quadratic cost relative to sequence length restricts their application to long sequences and high-resolution inputs. We introduce Fast Multipole Attention (FMA), a divide-and-conquer mechanism for self-attention inspired by the Fast Multipole Method from n-body physics. FMA reduces the time and memory complexity of self-attention from to and while preserving full-context interactions. FMA contains a learned hierarchy with levels of resolution. In this hierarchy, nearby tokens interact at full resolution, while distant tokens engage through progressively coarser, learned basis functions. We have developed both 1D and 2D implementations of FMA for language and vision tasks, respectively. On autoregressive and bidirectional language modeling benchmarks, the 1D variant either matches or outperforms leading efficient attention baselines with substantially lower memory use. With linear complexity, the 2D variant demonstrates superior performance over strong vision transformer baselines in classification and semantic segmentation tasks. Our results confirm that the multilevel attention implemented by FMA allows Transformer-based models to scale to much longer sequences and higher-resolution inputs without loss in accuracy. This provides a principled, physics-inspired approach for developing scalable neural networks suitable for language, vision, and multimodal tasks. Our code will be available at https://github.com/epoch98/FMA.
Paper Structure (23 sections, 22 equations, 9 figures, 6 tables)

This paper contains 23 sections, 22 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Conceptual view of 1D FMA. Top: full attention computes all $n$ pairwise interactions for query token $i$. Bottom: FMA keeps exact attention in a $3r$ window (white cells); tokens beyond that window are first merged into progressively larger groups (in gray shades), and a single aggregated interaction is computed per group.
  • Figure 2: FMA 1D sparsified attention matrix. White cells are exact (near field). Gray cells are far field. Darker gray means coarser resolution.
  • Figure 3: FMA 2D hierarchy and a corresponding sparsified attention matrix. White cells are exact (near field); gray cells are far field. Darker gray means coarser resolution. Panel (b) corresponds to the top left quarter of panel (a).
  • Figure 4: Cyclic shift mechanism in Swin Transformer. Each layer computes attention in fixed local windows. The next layer shifts the patches in the top left direction in a cyclic manner so that boundary tokens meet new neighbors.
  • Figure 5: Attention sparsity across all three stages of a Swin Transformer. cells with different shades belong to different stages. Cells may overlap. Only the lightest (i.e., finest) shade is shown when cells overlap. Each layer attends only within a fixed window, so global information must percolate through successive layers with alternating shifts, unlike the single layer global reach of FMA2D (Fig. \ref{['fig:fma_matrices2d']}).
  • ...and 4 more figures