Learning, Solving and Optimizing PDEs with TensorGalerkin: an efficient high-performance Galerkin assembly algorithm

Shizheng Wen; Mingyuan Chi; Tianwei Yu; Ben Moseley; Mike Yan Michelis; Pu Ren; Hao Sun; Siddhartha Mishra

Learning, Solving and Optimizing PDEs with TensorGalerkin: an efficient high-performance Galerkin assembly algorithm

Shizheng Wen, Mingyuan Chi, Tianwei Yu, Ben Moseley, Mike Yan Michelis, Pu Ren, Hao Sun, Siddhartha Mishra

TL;DR

TensorGalerkin delivers a unified, high-performance framework for solving, learning, and optimizing PDEs with a variational structure by reformulating Galerkin assembly as a two-stage Map–Reduce pipeline that tensorizes local contractions and uses deterministic sparse projections via SpMM to assemble $K$ and $F$. This enables GPU-accelerated solvers (TensorMesh), physics-informed operator learning (TensorPils) that leverages analytical shape gradients, and end-to-end differentiable PDE-constrained optimization (TensorOpt) within PyTorch. Across 2D/3D elliptic, parabolic, and hyperbolic PDE benchmarks, the approach yields substantial speedups with maintained or improved accuracy compared to strong baselines (FEniCS, SKFEM, JAX-FEM, PINNs, PI-DeepONet). By enabling efficient many-query PDE workflows, TensorGalerkin provides a practical, scalable foundation for physics-informed learning and design optimization on unstructured meshes.

Abstract

We present a unified algorithmic framework for the numerical solution, constrained optimization, and physics-informed learning of PDEs with a variational structure. Our framework is based on a Galerkin discretization of the underlying variational forms, and its high efficiency stems from a novel highly-optimized and GPU-compliant TensorGalerkin framework for linear system assembly (stiffness matrices and load vectors). TensorGalerkin operates by tensorizing element-wise operations within a Python-level Map stage and then performs global reduction with a sparse matrix multiplication that performs message passing on the mesh-induced sparsity graph. It can be seamlessly employed downstream as i) a highly-efficient numerical PDEs solver, ii) an end-to-end differentiable framework for PDE-constrained optimization, and iii) a physics-informed operator learning algorithm for PDEs. With multiple benchmarks, including 2D and 3D elliptic, parabolic, and hyperbolic PDEs on unstructured meshes, we demonstrate that the proposed framework provides significant computational efficiency and accuracy gains over a variety of baselines in all the targeted downstream applications.

Learning, Solving and Optimizing PDEs with TensorGalerkin: an efficient high-performance Galerkin assembly algorithm

TL;DR

and

. This enables GPU-accelerated solvers (TensorMesh), physics-informed operator learning (TensorPils) that leverages analytical shape gradients, and end-to-end differentiable PDE-constrained optimization (TensorOpt) within PyTorch. Across 2D/3D elliptic, parabolic, and hyperbolic PDE benchmarks, the approach yields substantial speedups with maintained or improved accuracy compared to strong baselines (FEniCS, SKFEM, JAX-FEM, PINNs, PI-DeepONet). By enabling efficient many-query PDE workflows, TensorGalerkin provides a practical, scalable foundation for physics-informed learning and design optimization on unstructured meshes.

Abstract

Paper Structure (68 sections, 51 equations, 21 figures, 7 tables, 2 algorithms)

This paper contains 68 sections, 51 equations, 21 figures, 7 tables, 2 algorithms.

Introduction
Methods
Problem Formulation.
Neural Galerkin Discretization.
Numerical PDE Solvers.
Physics-informed Operator Learning and Neural PDE Solvers.
Algorithmic Realization and Bottlenecks.
The TensorGalerkin Framework.
Analysis of the Computational Graph.
Downstream Applications of TensorGalerkin.
Results
Numerical PDE solver.
Neural PDE Solver.
Physics-informed Operator Learning.
PDE Constrained Inverse Design.
...and 53 more sections

Figures (21)

Figure 1: Overview of TensorGalerkin. Stage I (Batch-Map) computes element-wise operators via a fully tensorized einsum kernel; Stage II (Sparse-Reduce) assembles global sparse values via routing matrices and a single SpMM. For comparison, the white box illustrates traditional FEM assembly via per-element loops and scatter-add (atomics) into the global system. The same assembly engine powers TensorMesh, TensorPils, and TensorOpt.
Figure 2: Runtime performance comparison. We report the solve times for (a) the Poisson equation and (b) linear elasticity problems on 3D meshes.
Figure 3: CUDA runtime of one forward loss computation vs. DoF for different training objectives.
Figure B.1: Relative linear-system residual vs. degrees of freedom (DoF) for 3D Poisson and 3D elasticity.
Figure B.2: Poisson3D solution visualization across solvers.
...and 16 more figures

Learning, Solving and Optimizing PDEs with TensorGalerkin: an efficient high-performance Galerkin assembly algorithm

TL;DR

Abstract

Learning, Solving and Optimizing PDEs with TensorGalerkin: an efficient high-performance Galerkin assembly algorithm

Authors

TL;DR

Abstract

Table of Contents

Figures (21)