Continuum Attention for Neural Operators
Edoardo Calvello, Nikola B. Kovachki, Matthew E. Levine, Andrew M. Stuart
TL;DR
The paper develops a continuum formulation of attention that acts on spaces of functions, enabling discretization invariant neural operators for learning mappings between function spaces. It introduces transformer based neural operators including vanilla TNO, ViTNO, and FANO, and proves a universal approximation theorem for transformer neural operators with a minor architectural modification. Patch-based attention is extended to create efficient, mesh invariant architectures, while numerical experiments on Lorenz 63, Darcy flow, and Kolmogorov NS demonstrate strong accuracy, zero-shot generalization across discretizations, and favorable parameter efficiency. The framework unifies attention theory with operator learning and offers scalable approaches for solving parametric PDEs and data assimilation problems.
Abstract
Transformers, and the attention mechanism in particular, have become ubiquitous in machine learning. Their success in modeling nonlocal, long-range correlations has led to their widespread adoption in natural language processing, computer vision, and time series problems. Neural operators, which map spaces of functions into spaces of functions, are necessarily both nonlinear and nonlocal if they are universal; it is thus natural to ask whether the attention mechanism can be used in the design of neural operators. Motivated by this, we study transformers in the function space setting. We formulate attention as a map between infinite dimensional function spaces and prove that the attention mechanism as implemented in practice is a Monte Carlo or finite difference approximation of this operator. The function space formulation allows for the design of transformer neural operators, a class of architectures designed to learn mappings between function spaces. In this paper, we state and prove the first universal approximation result for transformer neural operators, using only a slight modification of the architecture implemented in practice. The prohibitive cost of applying the attention operator to functions defined on multi-dimensional domains leads to the need for more efficient attention-based architectures. For this reason we also introduce a function space generalization of the patching strategy from computer vision, and introduce a class of associated neural operators. Numerical results, on an array of operator learning problems, demonstrate the promise of our approaches to function space formulations of attention and their use in neural operators.
