Table of Contents
Fetching ...

Automatic Differentiation is Essential in Training Neural Networks for Solving Differential Equations

Chuqi Chen, Yahong Yang, Yang Xiang, Wenrui Hao

TL;DR

This work compares automatic differentiation (AD) and finite difference (FD) differentiation in neural PDE solvers, introducing a truncated-entropy metric $H_A(a)$ to quantify how the singular-value spectrum influences training. Through theoretical analysis and extensive experiments on random-feature models and two-layer neural networks solving Poisson and biharmonic equations, the authors show AD and FD share the same large singular values, but FD retains more small singular values, slowing training. The truncated entropy framework links spectral properties to training speed, predicting that AD enables faster convergence and smaller training residuals, a result borne out particularly in higher-order PDEs and deeper networks. The findings offer a principled lens to diagnose and potentially improve training dynamics for physics-informed neural networks and related PDE solvers in science and engineering.

Abstract

Neural network-based approaches have recently shown significant promise in solving partial differential equations (PDEs) in science and engineering, especially in scenarios featuring complex domains or incorporation of empirical data. One advantage of the neural network methods for PDEs lies in its automatic differentiation (AD), which necessitates only the sample points themselves, unlike traditional finite difference (FD) approximations that require nearby local points to compute derivatives. In this paper, we quantitatively demonstrate the advantage of AD in training neural networks. The concept of truncated entropy is introduced to characterize the training property. Specifically, through comprehensive experimental and theoretical analyses conducted on random feature models and two-layer neural networks, we discover that the defined truncated entropy serves as a reliable metric for quantifying the residual loss of random feature models and the training speed of neural networks for both AD and FD methods. Our experimental and theoretical analyses demonstrate that, from a training perspective, AD outperforms FD in solving PDEs.

Automatic Differentiation is Essential in Training Neural Networks for Solving Differential Equations

TL;DR

This work compares automatic differentiation (AD) and finite difference (FD) differentiation in neural PDE solvers, introducing a truncated-entropy metric to quantify how the singular-value spectrum influences training. Through theoretical analysis and extensive experiments on random-feature models and two-layer neural networks solving Poisson and biharmonic equations, the authors show AD and FD share the same large singular values, but FD retains more small singular values, slowing training. The truncated entropy framework links spectral properties to training speed, predicting that AD enables faster convergence and smaller training residuals, a result borne out particularly in higher-order PDEs and deeper networks. The findings offer a principled lens to diagnose and potentially improve training dynamics for physics-informed neural networks and related PDE solvers in science and engineering.

Abstract

Neural network-based approaches have recently shown significant promise in solving partial differential equations (PDEs) in science and engineering, especially in scenarios featuring complex domains or incorporation of empirical data. One advantage of the neural network methods for PDEs lies in its automatic differentiation (AD), which necessitates only the sample points themselves, unlike traditional finite difference (FD) approximations that require nearby local points to compute derivatives. In this paper, we quantitatively demonstrate the advantage of AD in training neural networks. The concept of truncated entropy is introduced to characterize the training property. Specifically, through comprehensive experimental and theoretical analyses conducted on random feature models and two-layer neural networks, we discover that the defined truncated entropy serves as a reliable metric for quantifying the residual loss of random feature models and the training speed of neural networks for both AD and FD methods. Our experimental and theoretical analyses demonstrate that, from a training perspective, AD outperforms FD in solving PDEs.
Paper Structure (22 sections, 3 theorems, 67 equations, 14 figures)

This paper contains 22 sections, 3 theorems, 67 equations, 14 figures.

Key Result

Proposition 1

Denote the largest and smallest eigenvalue $\bm{E}^\intercal\bm{A}_{\text{FD}}+\bm{A}_{\text{FD}}^\intercal\bm{E}+h^2 \bm{E}^\intercal\bm{E}$ as $\bar{\lambda}, \underline{\lambda}$, we have that where $\lambda_{\max}(\bm{A})$ is denoted as the largest eigenvalue of $\bm{A}$.

Figures (14)

  • Figure 1: (a): Truncated entropy. (b): Relative training error $L_{\text{PINN}}$ ($\|\bm{A} \bm{a}-\bm{f}\|/\|\bm{f}\|$). They are depicted for both AD and FD methods with varying numbers of neurons in RFM for solving $u_{xx}=f(x)$ with Dirichlet boundary conditions. The exact solution is given by $u(x) = \sin(\pi x)$, where $x\in [-1,1]$. (The number of sample points equals the number of neurons, and the effective cutoff number is $e_A(10^{-12})$). The corresponding singular value of different $M$ and $N$ can be seen in Figure \ref{['Fig.Appendix2.1']} in Appendix.
  • Figure 2: Distribution of singular values for $\bm{A}_{\text{FD}}$ and $\bm{A}_{\text{AD}}$ for solving Poisson equation using random feature method with vary Dimensions $d$, activation functions $\sigma(x)$, number of sample points $N$ and number of neurons $M$. (a):$d=1$, $\sigma(x)=\sin(x)$ ,$M=N=100$. (b):$d=1$, $\sigma(x)=\tanh(x)$, $M=N=300$. (c):$d=2$, $\sigma(x) = \sin(x)$, $M=200$, $N=64\times64$. (d):$d=2$, $\sigma(x) = \sin(x)$, $M=N=16\times16$.
  • Figure 3: (a): The distribution of singular values alongside their respective truncation positions. Green dots represent truncated positions based on $e_{\bm{A}}(10^{-12})$, with truncated etntropy values of $H_{\bm{A}_\text{FD}}(10^{-12}) = 0.1995$ and $H_{\bm{A}_\text{AD}}(10^{-12}) = 0.4183$. (b): Relative training error of $L_\text{PINN}$, denoted as $\frac{\|\bm{A} \bm{a} -\bm{f}\|}{\|\bm{f}\|}$, obtained using the truncated singular value decomposition (SVD) method. The horizontal axis represents the truncation position determined by the effective cut-off number. The dashed lines differentiate between different cases based on the truncation position in both (a) and (b).
  • Figure 4: (a): Distribution of eigenvalues of kernel matrix $G$ with the red dashed line indicating the approximate convergence position $e_{\bm{G}_k}(10^{-5})$ at the end of training. Truncated entropy values are $H_{\bm{G}_\text{FD}}(10^{-5}) = 0.5785$ and $H_{\bm{G}_\text{AD}}(10^{-5}) = 0.2606$. (b): Training curve of relative training error $\frac{L_F}{\|f\|}$. (c): Training curve of the loss function $L_{\text{PINN}}$. (Similar performance observed for $L_{F}$ and $L_{\text{PINN}}$ due to the boundary conditions being treated as an identity operator.) More training curves of training error and their corresponding curves of L2 relative error can be viewed in Figures \ref{['Fig.lcurverelativecurve']} in appendix.
  • Figure 5: Poisson Equation in 2D.(a): Distribution of singular values of the random feature matrix $\bm{A}_k$ with the red dashed line indicating the effective cut-off number is $e_{\bm{G}_k}(10^{-13})$. The truncated entropy values are $H_{\bm{A}_\text{FD}}(10^{-13}) = 0.3122$ and $H_{\bm{A}_\text{AD}}(10^{-13}) = 0.3342$. (b): Relative training error of $L_\text{PINN}$, denoted as $\frac{\|\bm{A}_k \bm{a} -\bm{f}\|}{\|\bm{f}\|}$, obtained using the truncated SVD method. The horizontal axis represents the truncation position determined by the effective cut-off number. (c): Distribution of eigenvalues of kernel matrix $G$ with the red dashed line indicating the approximate convergence position $e_{\bm{G}_k}(10^{-4})$ at the end of training. The truncated entropy values are $H_{\bm{G}_\text{FD}}(10^{-4}) = 0.3336$ and $H_{\bm{G}_\text{AD}}(10^{-4}) = 0.4460$. (d): Training curve of relative training error of the residual of PDE equation, i.e., $\frac{L_F}{\|f\|}$. (The training curve of the total loss $L_{\text{PINN}}$ is shown in Figure A.3.1.)
  • ...and 9 more figures

Theorems & Definitions (11)

  • Definition 1: Effective cut-off number
  • Definition 2: Truncated entropy
  • Remark 1
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Remark 2
  • Theorem 1
  • proof
  • ...and 1 more