Table of Contents
Fetching ...

Transolver is a Linear Transformer: Revisiting Physics-Attention through the Lens of Linear Attention

Wenjie Hu, Sidun Liu, Peng Qiao, Zhenglun Sun, Yong Dou

TL;DR

The paper reframes Transolver’s Physics-Attention as a special case of linear attention and shows that slice interactions are not essential for performance. By introducing a two-step process—generalization to asymmetric Q/K projections and simplification by removing slice attention—the authors derive LinearNO, a canonical linear-attention neural operator. LinearNO achieves state-of-the-art accuracy across six standard PDE benchmarks and two industrial datasets (AirfRANS and Shape-Net Car) while reducing parameters by about 40% and FLOPs by about 36% on average. It also demonstrates strong out-of-distribution generalization and discretization-invariance, supported by theoretical convergence to a continuous integral operator and extensive ablations. Overall, LinearNO provides a more efficient, flexible, and scalable solution for data-driven PDE solvers with practical impact for engineering design and complex simulations.

Abstract

Recent advances in Transformer-based Neural Operators have enabled significant progress in data-driven solvers for Partial Differential Equations (PDEs). Most current research has focused on reducing the quadratic complexity of attention to address the resulting low training and inference efficiency. Among these works, Transolver stands out as a representative method that introduces Physics-Attention to reduce computational costs. Physics-Attention projects grid points into slices for slice attention, then maps them back through deslicing. However, we observe that Physics-Attention can be reformulated as a special case of linear attention, and that the slice attention may even hurt the model performance. Based on these observations, we argue that its effectiveness primarily arises from the slice and deslice operations rather than interactions between slices. Building on this insight, we propose a two-step transformation to redesign Physics-Attention into a canonical linear attention, which we call Linear Attention Neural Operator (LinearNO). Our method achieves state-of-the-art performance on six standard PDE benchmarks, while reducing the number of parameters by an average of 40.0% and computational cost by 36.2%. Additionally, it delivers superior performance on two challenging, industrial-level datasets: AirfRANS and Shape-Net Car.

Transolver is a Linear Transformer: Revisiting Physics-Attention through the Lens of Linear Attention

TL;DR

The paper reframes Transolver’s Physics-Attention as a special case of linear attention and shows that slice interactions are not essential for performance. By introducing a two-step process—generalization to asymmetric Q/K projections and simplification by removing slice attention—the authors derive LinearNO, a canonical linear-attention neural operator. LinearNO achieves state-of-the-art accuracy across six standard PDE benchmarks and two industrial datasets (AirfRANS and Shape-Net Car) while reducing parameters by about 40% and FLOPs by about 36% on average. It also demonstrates strong out-of-distribution generalization and discretization-invariance, supported by theoretical convergence to a continuous integral operator and extensive ablations. Overall, LinearNO provides a more efficient, flexible, and scalable solution for data-driven PDE solvers with practical impact for engineering design and complex simulations.

Abstract

Recent advances in Transformer-based Neural Operators have enabled significant progress in data-driven solvers for Partial Differential Equations (PDEs). Most current research has focused on reducing the quadratic complexity of attention to address the resulting low training and inference efficiency. Among these works, Transolver stands out as a representative method that introduces Physics-Attention to reduce computational costs. Physics-Attention projects grid points into slices for slice attention, then maps them back through deslicing. However, we observe that Physics-Attention can be reformulated as a special case of linear attention, and that the slice attention may even hurt the model performance. Based on these observations, we argue that its effectiveness primarily arises from the slice and deslice operations rather than interactions between slices. Building on this insight, we propose a two-step transformation to redesign Physics-Attention into a canonical linear attention, which we call Linear Attention Neural Operator (LinearNO). Our method achieves state-of-the-art performance on six standard PDE benchmarks, while reducing the number of parameters by an average of 40.0% and computational cost by 36.2%. Additionally, it delivers superior performance on two challenging, industrial-level datasets: AirfRANS and Shape-Net Car.

Paper Structure

This paper contains 38 sections, 1 theorem, 38 equations, 10 figures, 14 tables.

Key Result

Theorem 1

Let $\{\bm{x}_i\}_{i=1}^{+\infty}$ be a sequence of refined meshes on $\Omega$ with $\bm{x}_i \thicksim \mu_{\Omega}$, and assume that the function $v(\bm{x})$ is bounded on $\Omega$. As $n \to +\infty$, for any $\epsilon > 0$, the proposed LinearNO converges in probability to a continuous integral Here, $\bm{R}$ are learnable parameters, $\mathcal{F}(\bm{x})$ is the continuous integral kernel op

Figures (10)

  • Figure 1: We observe that in most scenarios, removing the slice attention in Physics-Attention leads to performance improvement. This suggests that its effectiveness mainly stems from the slice and deslice operations, rather than the interactions between slices.
  • Figure 2: (a) The overall design of our network. The encoder and decoder modules follow the same architectural design as those in Transolver. (b) Comparison between Physics-Attention and LinearNO. The top shows the Physics-Attention in Transolver, while the bottom depicts our LinearNO. Softmax@M and Softmax@N indicate softmax operations along dimensions $M$ and $N$, respectively. Sum-norm@N refers to the standard normalization $x_i'=\frac{x_i}{\sum_{j=1}^N x_j}$ for each row.
  • Figure 3: Visualization of Slice-Weight Matrix $W_N$
  • Figure 4: The left plot shows the visualization of the final-layer $\varphi(\bm{Q})$ generated by LinearNO and Transolver. The middle plot visualizes the error distributions of LinearNO and Transolver. The right plot illustrates two metrics: the average rank of the attention matrix $\varphi(\bm{Q})\psi^\top(\bm{K})$ per head at each layer when Slice $M = 64$, and the prediction errors of both models on the Airfoil dataset under different slice numbers.
  • Figure 5: Overall Architecture with Cross-LinearNO.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof