Table of Contents
Fetching ...

Simple yet Effective: Low-Rank Spatial Attention for Neural Operators

Zherui Yang, Haiyang Xin, Tao Du, Ligang Liu

Abstract

Neural operators have emerged as data-driven surrogates for solving partial differential equations (PDEs), and their success hinges on efficiently modeling the long-range, global coupling among spatial points induced by the underlying physics. In many PDE regimes, the induced global interaction kernels are empirically compressible, exhibiting rapid spectral decay that admits low-rank approximations. We leverage this observation to unify representative global mixing modules in neural operators under a shared low-rank template: compressing high-dimensional pointwise features into a compact latent space, processing global interactions within it, and reconstructing the global context back to spatial points. Guided by this view, we introduce Low-Rank Spatial Attention (LRSA) as a clean and direct instantiation of this template. Crucially, unlike prior approaches that often rely on non-standard aggregation or normalization modules, LRSA is built purely from standard Transformer primitives, i.e., attention, normalization, and feed-forward networks, yielding a concise block that is straightforward to implement and directly compatible with hardware-optimized kernels. In our experiments, such a simple construction is sufficient to achieve high accuracy, yielding an average error reduction of over 17\% relative to second-best methods, while remaining stable and efficient in mixed-precision training.

Simple yet Effective: Low-Rank Spatial Attention for Neural Operators

Abstract

Neural operators have emerged as data-driven surrogates for solving partial differential equations (PDEs), and their success hinges on efficiently modeling the long-range, global coupling among spatial points induced by the underlying physics. In many PDE regimes, the induced global interaction kernels are empirically compressible, exhibiting rapid spectral decay that admits low-rank approximations. We leverage this observation to unify representative global mixing modules in neural operators under a shared low-rank template: compressing high-dimensional pointwise features into a compact latent space, processing global interactions within it, and reconstructing the global context back to spatial points. Guided by this view, we introduce Low-Rank Spatial Attention (LRSA) as a clean and direct instantiation of this template. Crucially, unlike prior approaches that often rely on non-standard aggregation or normalization modules, LRSA is built purely from standard Transformer primitives, i.e., attention, normalization, and feed-forward networks, yielding a concise block that is straightforward to implement and directly compatible with hardware-optimized kernels. In our experiments, such a simple construction is sufficient to achieve high accuracy, yielding an average error reduction of over 17\% relative to second-best methods, while remaining stable and efficient in mixed-precision training.

Paper Structure

This paper contains 63 sections, 37 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Compressibility of PDE interactions and the unified low-rank paradigm.Left: (a) original dense kernel, i.e., the Green function of 1D Poisson; (b--d) underlying low-rank properties, including: reconstructed kernel, fast spectral decay, and approximation error, derived from the numerical factorization illustrated in the middle. Middle: numerical low-rank approximation of global interactions via $K_r \approx U_r \Sigma_r V_r^\top$. Right: diverse neural-operator global-mixing modules unified as a learnable compress--process--reconstruct template.
  • Figure 2: Overview of the neural operator backbone and the Low-Rank Spatial Attention (LRSA) block. LRSA routes global information through a compact latent bottleneck using only standard Transformer primitives.
  • Figure 3: Qualitative performance comparison across diverse discretizations. From top-left to bottom-right: Navier-Stokes (regular grid), Elasticity (point cloud), Airfoil and Plasticity (structured grid). Error maps are visualized on the same scale for each task. LRSA yields lower relative errors and preserves sharper physical patterns in high-frequency regions compared to Transolver.
  • Figure 4: Training stability and efficiency. Left: relative $L_2$ error under FP32/BF16/FP16; $\boldsymbol{\times}$ denotes divergence. Right: per-step training latency (forward+backward, normalized to Transolver-FP32) and peak memory (ratio relative to Transolver-FP32; smaller is better) on three representative tasks. Memory Saving is calculated as the ratio of peak training memory consumption of the evaluated model to that of the baseline model (Transolver-FP32). A lower factor indicates better memory efficiency.
  • Figure 5: Rank and component ablations. Top: sensitivity to latent size $M$. Bottom: component variants of LRSA (Full, w/o latent self-attention, and enforcing symmetric compression and reconstruction).