FLARE: Fast Low-rank Attention Routing Engine
Vedant Puri, Aditya Joglekar, Kevin Ferguson, Yu-hsuan Chen, Yongjie Jessica Zhang, Levent Burak Kara
TL;DR
FLARE addresses the quadratic cost of self-attention on large unstructured meshes by routing attention through a fixed-length latent sequence per head, achieving linear $O(NM)$ time and enabling scalable PDE surrogate modeling. It introduces encoding and decoding cross-attentions with fixed latent queries, yielding a low-rank communication pattern that preserves expressivity through head-wise independent projections and deep residual MLPs for key/value projections. Spectral analysis confirms a low-rank, diverse, head-specific attention structure, and experiments show FLARE achieving state-of-the-art accuracy across diverse PDE benchmarks while scaling to one million points on a single GPU. The work also releases a large LPBF additive manufacturing dataset to spur further research and provides open-source code for integration with standard fused attention kernels.
Abstract
The quadratic complexity of self-attention limits its applicability and scalability on large unstructured meshes. We introduce Fast Low-rank Attention Routing Engine (FLARE), a linear complexity self-attention mechanism that routes attention through fixed-length latent sequences. Each attention head performs global communication among $N$ tokens by projecting the input sequence onto a fixed length latent sequence of $M \ll N$ tokens using learnable query tokens. By routing attention through a bottleneck sequence, FLARE learns a low-rank form of attention that can be applied at $O(NM)$ cost. FLARE not only scales to unprecedented problem sizes, but also delivers superior accuracy compared to state-of-the-art neural PDE surrogates across diverse benchmarks. We also release a new additive manufacturing dataset to spur further research. Our code is available at https://github.com/vpuri3/FLARE.py.
