Fast Multipole Attention: A Scalable Multilevel Attention Mechanism for Text and Images
Yanming Kang, Giang Tran, Hans De Sterck
TL;DR
The paper tackles the quadratic bottleneck of self-attention by introducing Fast Multipole Attention (FMA), a multilevel, physics-inspired mechanism that blends near-field exact interactions with far-field learned aggregations. By organizing attention in a hierarchical 1D (and 2D) structure, FMA achieves O(n log n) complexity (or O(n) with query downsampling) while preserving global context for long sequences and high-resolution images. Empirically, FMA outperforms strong efficient-attention baselines on autoregressive and bidirectional language benchmarks and surpasses vision transformers on ImageNet-1K and ADE20K while using comparable or lower memory. The approach is implemented with GPU-optimized kernels and opens avenues for 3D extensions and hybrid architectures.
Abstract
While Transformer networks benefit from a global receptive field, their quadratic cost relative to sequence length restricts their application to long sequences and high-resolution inputs. We introduce Fast Multipole Attention (FMA), a divide-and-conquer mechanism for self-attention inspired by the Fast Multipole Method from n-body physics. FMA reduces the time and memory complexity of self-attention from $\mathcal{O}\left(n^2\right)$ to $\mathcal{O}(n \log n)$ and $\mathcal{O}(n)$ while preserving full-context interactions. FMA contains a learned hierarchy with $\mathcal{O}(\log n)$ levels of resolution. In this hierarchy, nearby tokens interact at full resolution, while distant tokens engage through progressively coarser, learned basis functions. We have developed both 1D and 2D implementations of FMA for language and vision tasks, respectively. On autoregressive and bidirectional language modeling benchmarks, the 1D variant either matches or outperforms leading efficient attention baselines with substantially lower memory use. With linear complexity, the 2D variant demonstrates superior performance over strong vision transformer baselines in classification and semantic segmentation tasks. Our results confirm that the multilevel attention implemented by FMA allows Transformer-based models to scale to much longer sequences and higher-resolution inputs without loss in accuracy. This provides a principled, physics-inspired approach for developing scalable neural networks suitable for language, vision, and multimodal tasks. Our code will be available at https://github.com/epoch98/FMA.
