Table of Contents
Fetching ...

A Quasilinear Algorithm for Computing Higher-Order Derivatives of Deep Feed-Forward Neural Networks

Kyle R. Chickering

TL;DR

High-order derivatives in PINNs are expensive under standard autodifferentiation, with runtime scaling as $O\left(\frac{e^{\sqrt{n}}}{n}M^n\right)$ and memory $O(M^n)$. The authors introduce $n$-TangentProp, an exact quasilinear extension of TangentProp that uses Faà di Bruno's formula to compute $d^n/dx^n f(x)$ in $O\left(e^{\sqrt{n}} M\right)$ time and $O(nM)$ memory in a single forward pass. Empirically, they validate scaling across depths and widths, demonstrate substantial end-to-end PINN training speedups on Burgers self-similar profiles, and show that higher-order derivatives (up to nine) become computationally feasible where autodiff fails. The work suggests that adopting $n$-TangentProp can make PINNs more competitive for forward/inverse problems requiring many derivatives and complex Sobolev losses.

Abstract

The use of neural networks for solving differential equations is practically difficult due to the exponentially increasing runtime of autodifferentiation when computing high-order derivatives. We propose $n$-TangentProp, the natural extension of the TangentProp formalism \cite{simard1991tangent} to arbitrarily many derivatives. $n$-TangentProp computes the exact derivative $d^n/dx^n f(x)$ in quasilinear, instead of exponential time, for a densely connected, feed-forward neural network $f$ with a smooth, parameter-free activation function. We validate our algorithm empirically across a range of depths, widths, and number of derivatives. We demonstrate that our method is particularly beneficial in the context of physics-informed neural networks where \ntp allows for significantly faster training times than previous methods and has favorable scaling with respect to both model size and loss-function complexity as measured by the number of required derivatives. The code for this paper can be found at https://github.com/kyrochi/n\_tangentprop.

A Quasilinear Algorithm for Computing Higher-Order Derivatives of Deep Feed-Forward Neural Networks

TL;DR

High-order derivatives in PINNs are expensive under standard autodifferentiation, with runtime scaling as and memory . The authors introduce -TangentProp, an exact quasilinear extension of TangentProp that uses Faà di Bruno's formula to compute in time and memory in a single forward pass. Empirically, they validate scaling across depths and widths, demonstrate substantial end-to-end PINN training speedups on Burgers self-similar profiles, and show that higher-order derivatives (up to nine) become computationally feasible where autodiff fails. The work suggests that adopting -TangentProp can make PINNs more competitive for forward/inverse problems requiring many derivatives and complex Sobolev losses.

Abstract

The use of neural networks for solving differential equations is practically difficult due to the exponentially increasing runtime of autodifferentiation when computing high-order derivatives. We propose -TangentProp, the natural extension of the TangentProp formalism \cite{simard1991tangent} to arbitrarily many derivatives. -TangentProp computes the exact derivative in quasilinear, instead of exponential time, for a densely connected, feed-forward neural network with a smooth, parameter-free activation function. We validate our algorithm empirically across a range of depths, widths, and number of derivatives. We demonstrate that our method is particularly beneficial in the context of physics-informed neural networks where \ntp allows for significantly faster training times than previous methods and has favorable scaling with respect to both model size and loss-function complexity as measured by the number of required derivatives. The code for this paper can be found at https://github.com/kyrochi/n\_tangentprop.

Paper Structure

This paper contains 12 sections, 12 equations, 10 figures, 1 algorithm.

Figures (10)

  • Figure 1: Average runtime for a combined forward and backwards pass using autodifferentiation (red) and $n$-TangentProp (blue). The top and bottom frames show the same data, however the bottom frame is plotted with a logarithmic $y$-axis. Each model is run 100 times and the average for each trial is plotted. The network has 3 hidden layers of 24 neurons each, a common PINN architecture. The batch size is $2^{8}=256$ samples. The forward and backwards pass times are shown separately in Figures \ref{['fig:forward_times']} and \ref{['fig:backward_times']} respectively.
  • Figure 2: Forward pass times for the model shown in Figure \ref{['fig:forward_pass_times_const_params']}. The top and bottom frames show the same data, however the bottom frame is plotted with a logarithmic $y$-axis. Each model is run 100 times and the average for each trial is plotted. The network has 3 hidden layers of 24 neurons each, a common PINN architecture. The batch size is $2^{8}=256$ samples.
  • Figure 3: Backwards pass times for the model shown in Figure \ref{['fig:forward_pass_times_const_params']}. The top and bottom frames show the same data, however the bottom frame is plotted with a logarithmic $y$-axis. Each model is run 100 times and the average for each trial is plotted. The network has 3 hidden layers of 24 neurons each, a common PINN architecture. The batch size is $2^{8}=256$ samples.
  • Figure 4: The ratio of forward pass run times between autodifferentiation and $n$-TangentProp for a variety of network architectures, input batch sizes, and number of derivatives. A ratio greater than $1$ indicates that $n$-TangentProp was faster than autodifferentiation. The baseline ratio of $1$ is plotted as a horizontal dashed line. All plotted data points represent the average of $100$ trials.
  • Figure 5: The ratio of combined forward-backward pass run times between autodifferentiation and $n$-TangentProp for a variety of network architectures, input batch sizes, and number of derivatives. A ratio greater than $1$ indicates that $n$-TangentProp was faster than autodifferentiation. The baseline ratio of $1$ is plotted as a horizontal dashed line. All plotted data points represent the average of $100$ trials. The forward pass time ratio alone is plotted in Figure \ref{['fig:ftime_grid']}.
  • ...and 5 more figures