Table of Contents
Fetching ...

Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives

Conor Rowan

TL;DR

The work analyzes exact Newton methods for regression problems with nonlinear neural discretizations and PINNs, showing they frequently converge to trivial saddle points rather than minima due to the geometry of approximation manifolds. It argues that the practical success of second-order optimization stems from curvature safeguards in quasi-Newton and saddle-free variants, which avoid negative curvature and descent into saddles. Through neural network regression and PINN experiments, the authors reveal that Newton's method can compel the network to learn an orthogonal basis or satisfy differential-operator orthogonality rather than minimize the target error, highlighting the prevalence of saddle points in high dimensions. These insights clarify why curvature-aware second-order methods outperform plain Newton in practice and suggest that saddles, not local minima, dominate large neural network loss landscapes.

Abstract

Second-order methods are emerging as promising alternatives to standard first-order optimizers such as gradient descent and ADAM for training neural networks. Though the advantages of including curvature information in computing optimization steps have been celebrated in the scientific machine learning literature, the only second-order methods that have been studied are quasi-Newton, meaning that the Hessian matrix of the objective function is approximated. Though one would expect only to gain from using the true Hessian in place of its approximation, we show that neural network training reliably fails when relying on exact curvature information. The failure modes provide insight both into the geometry of nonlinear discretizations as well as the distribution of stationary points in the loss landscape, leading us to question the conventional wisdom that the loss landscape is replete with local minima.

Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives

TL;DR

The work analyzes exact Newton methods for regression problems with nonlinear neural discretizations and PINNs, showing they frequently converge to trivial saddle points rather than minima due to the geometry of approximation manifolds. It argues that the practical success of second-order optimization stems from curvature safeguards in quasi-Newton and saddle-free variants, which avoid negative curvature and descent into saddles. Through neural network regression and PINN experiments, the authors reveal that Newton's method can compel the network to learn an orthogonal basis or satisfy differential-operator orthogonality rather than minimize the target error, highlighting the prevalence of saddle points in high dimensions. These insights clarify why curvature-aware second-order methods outperform plain Newton in practice and suggest that saddles, not local minima, dominate large neural network loss landscapes.

Abstract

Second-order methods are emerging as promising alternatives to standard first-order optimizers such as gradient descent and ADAM for training neural networks. Though the advantages of including curvature information in computing optimization steps have been celebrated in the scientific machine learning literature, the only second-order methods that have been studied are quasi-Newton, meaning that the Hessian matrix of the objective function is approximated. Though one would expect only to gain from using the true Hessian in place of its approximation, we show that neural network training reliably fails when relying on exact curvature information. The failure modes provide insight both into the geometry of nonlinear discretizations as well as the distribution of stationary points in the loss landscape, leading us to question the conventional wisdom that the loss landscape is replete with local minima.

Paper Structure

This paper contains 5 sections, 32 equations, 12 figures.

Figures (12)

  • Figure 1: The error vector is orthogonal to the tangent of the unit circle approximation space both when the magnitude of the error vector is minimized and when it is maximized.
  • Figure 2: The nonlinear discretization of a vector in $\mathbb{R}^3$ is defined by two parameters that traverse the surface of an ellipsoidal torus. The approximation space is visualized in 3D (left) and in cross-section (right).
  • Figure 3: There are multiple stationary points of the objective function for the regression problem. Using the Hessian matrices, we classify Solution 1 as a minimum, Solution 2 as a saddle point, and Solution 3 as a maximum.
  • Figure 4: All stationary points are in the $x_3=0$ plane and lie along one of the coordinate axes (left). We show $25$ convergence histories of Newton's method for random initializations of the parameters (right). Converged solutions are indicated by blue dots. Note that by periodicity, the saddle points at each of the four corners are actually the same solution. This is also the case for the minimum found along the left and right edge of the domain and the saddle at the center of the top and bottom edges.
  • Figure 5: Exact Newton optimization obtains the trivial solution we identified. Note that the magnitude of each basis function is normalized to unity for visualization purposes.
  • ...and 7 more figures