Automatic Differentiation is Essential in Training Neural Networks for Solving Differential Equations
Chuqi Chen, Yahong Yang, Yang Xiang, Wenrui Hao
TL;DR
This work compares automatic differentiation (AD) and finite difference (FD) differentiation in neural PDE solvers, introducing a truncated-entropy metric $H_A(a)$ to quantify how the singular-value spectrum influences training. Through theoretical analysis and extensive experiments on random-feature models and two-layer neural networks solving Poisson and biharmonic equations, the authors show AD and FD share the same large singular values, but FD retains more small singular values, slowing training. The truncated entropy framework links spectral properties to training speed, predicting that AD enables faster convergence and smaller training residuals, a result borne out particularly in higher-order PDEs and deeper networks. The findings offer a principled lens to diagnose and potentially improve training dynamics for physics-informed neural networks and related PDE solvers in science and engineering.
Abstract
Neural network-based approaches have recently shown significant promise in solving partial differential equations (PDEs) in science and engineering, especially in scenarios featuring complex domains or incorporation of empirical data. One advantage of the neural network methods for PDEs lies in its automatic differentiation (AD), which necessitates only the sample points themselves, unlike traditional finite difference (FD) approximations that require nearby local points to compute derivatives. In this paper, we quantitatively demonstrate the advantage of AD in training neural networks. The concept of truncated entropy is introduced to characterize the training property. Specifically, through comprehensive experimental and theoretical analyses conducted on random feature models and two-layer neural networks, we discover that the defined truncated entropy serves as a reliable metric for quantifying the residual loss of random feature models and the training speed of neural networks for both AD and FD methods. Our experimental and theoretical analyses demonstrate that, from a training perspective, AD outperforms FD in solving PDEs.
