Table of Contents
Fetching ...

Dual Natural Gradient Descent for Scalable Training of Physics-Informed Neural Networks

Anas Jnini, Flavio Vella

TL;DR

This work introduces Dual Natural Gradient Descent (D-NGD) for Physics-Informed Neural Networks, reframing the Gauss–Newton step from parameter space to a much smaller residual space of dimension $m=\sum_\gamma N_\gamma d_\gamma$. It adds a geodesic-acceleration correction and provides both a dense direct solver and a Hessian-free, Nyström-preconditioned iterative solver, enabling scalable second-order optimization for PINNs with millions of parameters on a single GPU. Across a broad set of high-dimensional PDE benchmarks, D-NGD achieves state-of-the-art accuracy, often improving final $L^2$ errors by one to three orders of magnitude over first-order and quasi-Newton baselines. The approach makes curvature-aware PINN training practical at scale and opens avenues for efficient operator learning and large-scale physics-informed modeling.

Abstract

Natural-gradient methods markedly accelerate the training of Physics-Informed Neural Networks (PINNs), yet their Gauss--Newton update must be solved in the parameter space, incurring a prohibitive $O(n^3)$ time complexity, where $n$ is the number of network trainable weights. We show that exactly the same step can instead be formulated in a generally smaller residual space of size $m = \sum_γ N_γ d_γ$, where each residual class $γ$ (e.g. PDE interior, boundary, initial data) contributes $N_γ$ collocation points of output dimension $d_γ$. Building on this insight, we introduce \textit{Dual Natural Gradient Descent} (D-NGD). D-NGD computes the Gauss--Newton step in residual space, augments it with a geodesic-acceleration correction at negligible extra cost, and provides both a dense direct solver for modest $m$ and a Nystrom-preconditioned conjugate-gradient solver for larger $m$. Experimentally, D-NGD scales second-order PINN optimization to networks with up to 12.8 million parameters, delivers one- to three-order-of-magnitude lower final error $L^2$ than first-order methods (Adam, SGD) and quasi-Newton methods, and -- crucially -- enables natural-gradient training of PINNs at this scale on a single GPU.

Dual Natural Gradient Descent for Scalable Training of Physics-Informed Neural Networks

TL;DR

This work introduces Dual Natural Gradient Descent (D-NGD) for Physics-Informed Neural Networks, reframing the Gauss–Newton step from parameter space to a much smaller residual space of dimension . It adds a geodesic-acceleration correction and provides both a dense direct solver and a Hessian-free, Nyström-preconditioned iterative solver, enabling scalable second-order optimization for PINNs with millions of parameters on a single GPU. Across a broad set of high-dimensional PDE benchmarks, D-NGD achieves state-of-the-art accuracy, often improving final errors by one to three orders of magnitude over first-order and quasi-Newton baselines. The approach makes curvature-aware PINN training practical at scale and opens avenues for efficient operator learning and large-scale physics-informed modeling.

Abstract

Natural-gradient methods markedly accelerate the training of Physics-Informed Neural Networks (PINNs), yet their Gauss--Newton update must be solved in the parameter space, incurring a prohibitive time complexity, where is the number of network trainable weights. We show that exactly the same step can instead be formulated in a generally smaller residual space of size , where each residual class (e.g. PDE interior, boundary, initial data) contributes collocation points of output dimension . Building on this insight, we introduce \textit{Dual Natural Gradient Descent} (D-NGD). D-NGD computes the Gauss--Newton step in residual space, augments it with a geodesic-acceleration correction at negligible extra cost, and provides both a dense direct solver for modest and a Nystrom-preconditioned conjugate-gradient solver for larger . Experimentally, D-NGD scales second-order PINN optimization to networks with up to 12.8 million parameters, delivers one- to three-order-of-magnitude lower final error than first-order methods (Adam, SGD) and quasi-Newton methods, and -- crucially -- enables natural-gradient training of PINNs at this scale on a single GPU.

Paper Structure

This paper contains 61 sections, 3 theorems, 53 equations, 5 figures, 12 tables, 6 algorithms.

Key Result

Proposition 3.1

For any parameter vector $\theta_k\in\mathbb{R}^n$, define the primal Gramian The unique minimiser $\Delta\theta_k^\star\in\mathbb{R}^n$ of the damped least-squares subproblem then satisfies Here, $\nabla_{\theta}L(\theta_k)=J(\theta_k)^\top r(\theta_k)$ is the gradient of the unweighted least-squares loss $\tfrac{1}{2}\|r(\theta_k)\|_2^2$.

Figures (5)

  • Figure 1: Left: Heat equation in $10{+}1$ d; Right: Logarithmic Fokker–Planck in $9{+}1$ d. We plot the median relative $L^2$ error across all runs, with shaded bands indicating the interquartile range (25th–75th percentiles) for all solvers.
  • Figure 2: Left: Kovásznay flow at $Re=40$; Right: Allen-Cahn reaction-diffusion. We plot the median relative $L^2$ error across all runs, with shaded bands indicating the interquartile range (25th–75th percentiles) for all solvers.
  • Figure 3: Left: Lid-driven cavity at $Re=3000$; Right: Poisson Equation in 10 Dimensions (with PCGD-NGD). We plot the median relative $L^2$ error across all runs, with shaded bands indicating the interquartile range (25th–75th percentiles) for all solvers.
  • Figure 4: Poisson Equation in 100,000 Dimensions We plot the median relative $L^2$ error across all runs, with shaded bands indicating the interquartile range (25th–75th percentiles) for all solvers.
  • Figure 5: Evolution of the first 5000 eigenvalues of the Gramian matrix at different training iterations for the Lid-driven cavity problem. The plots show that the spectrum decays rapidly in the earlier stages of training.

Theorems & Definitions (7)

  • Proposition 3.1: Primal normal equations
  • Definition 3.1: Residual Gramian
  • Theorem 3.1: Dual Normal Equations
  • Definition 3.2: Geodesic-Acceleration Subproblem
  • Proposition 3.2: Primal and Dual GA Characterizations
  • Remark 3.1: Implementation Cost of GA
  • Definition 3.3: Kernel–Vector Product