Exact Gauss-Newton Optimization for Training Deep Neural Networks

Mikalai Korbit; Adeyemi D. Adeoye; Alberto Bemporad; Mario Zanon

Exact Gauss-Newton Optimization for Training Deep Neural Networks

Mikalai Korbit, Adeyemi D. Adeoye, Alberto Bemporad, Mario Zanon

TL;DR

Exact Gauss-Newton (EGN) introduces a stochastic second-order optimizer that uses the Duncan-Guttman identity to compute the Levenberg-Marquardt direction exactly in a low-dimensional space of size $bc$ by solving $(\mathbf{Q}\mathbf{J}\mathbf{J}^{\top} + b\lambda\mathbf{I}_{bc})^{-1}\mathbf{r}$, avoiding the cubic cost of full Hessian inversions. The method relies on the generalized Gauss-Newton approximation $\mathbf{H}^{\mathrm{GN}} = \frac{1}{b}\mathbf{J}^{\top}\mathbf{Q}\mathbf{J}$ and regularizes it with $\lambda\mathbf{I}$, enabling stable updates even in stochastic settings. A convergence analysis shows that, under standard assumptions, the gradient norm converges to zero in expectation, and empirical results demonstrate that EGN often outperforms or matches tuned SGD, Adam, GAF, SQN, and SGN across supervised and reinforcement learning tasks. The work highlights practical enhancements—momentum, line search, and adaptive regularization—to further boost robustness and performance, while noting limitations related to explicit Jacobians, large batch regimes, and multi-class classification with many classes.

Abstract

We present Exact Gauss-Newton (EGN), a stochastic second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction. Leveraging the Duncan-Guttman matrix identity, the parameter update is obtained by factorizing a matrix which has the size of the mini-batch. This is particularly advantageous for large-scale machine learning problems where the dimension of the neural network parameter vector is several orders of magnitude larger than the batch size. Additionally, we show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm. Moreover, under mild assumptions, we prove that our algorithm converges in expectation to a stationary point of the objective. Finally, our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the generalization performance of well-tuned SGD, Adam, GAF, SQN, and SGN optimizers across various supervised and reinforcement learning tasks.

Exact Gauss-Newton Optimization for Training Deep Neural Networks

TL;DR

Exact Gauss-Newton (EGN) introduces a stochastic second-order optimizer that uses the Duncan-Guttman identity to compute the Levenberg-Marquardt direction exactly in a low-dimensional space of size

by solving

, avoiding the cubic cost of full Hessian inversions. The method relies on the generalized Gauss-Newton approximation

and regularizes it with

, enabling stable updates even in stochastic settings. A convergence analysis shows that, under standard assumptions, the gradient norm converges to zero in expectation, and empirical results demonstrate that EGN often outperforms or matches tuned SGD, Adam, GAF, SQN, and SGN across supervised and reinforcement learning tasks. The work highlights practical enhancements—momentum, line search, and adaptive regularization—to further boost robustness and performance, while noting limitations related to explicit Jacobians, large batch regimes, and multi-class classification with many classes.

Abstract

Paper Structure (43 sections, 4 theorems, 73 equations, 3 figures, 8 tables, 7 algorithms)

This paper contains 43 sections, 4 theorems, 73 equations, 3 figures, 8 tables, 7 algorithms.

Introduction
Related Work
Preliminaries
Gradient-based Optimization
Generalized Gauss-Newton Hessian Approximation
Regression
Multi-class Classification
Algorithm
Comparison to Existing Methods
Additional Improvements
Momentum
Line Search
Adaptive regularization
Convergence Analysis
Experiments
...and 28 more sections

Key Result

Theorem 4.1

Assuming $\mathbf{A}$ and $\mathbf{D}$ are full-rank matrices, the following identity holds

Figures (3)

Figure 1: Learning curves on the test set for SGD, Adam, GAF, SQN, SGN and EGN. The shaded area represents $\pm1$ standard deviation around the mean (thick line) for $10$ seeds.
Figure 2: Learning curves for SGD, Adam, SGN and EGN. The shaded area represents $\pm 1$ standard deviation around the mean return (thick line) for $10$ seeds.
Figure 3: Proportion of total update time spent in Algorithm \ref{['alg:slm_direction_finder']} as a function of batch size for fully-connected neural networks (MLPs) of various sizes. Curves show the mean over 1000 runs; shaded regions denote $\pm 1$ standard deviation. Absolute wall-clock times are reported in Appendix \ref{['apx:limitations']}.

Theorems & Definitions (12)

Theorem 4.1: duncan1944lxxviiiguttman1946enlargement
Lemma 4.2
proof
Lemma 5.5
proof
Theorem 5.6
proof
proof
proof
proof
...and 2 more

Exact Gauss-Newton Optimization for Training Deep Neural Networks

TL;DR

Abstract

Exact Gauss-Newton Optimization for Training Deep Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (12)