Table of Contents
Fetching ...

Exact Gauss-Newton Optimization for Training Deep Neural Networks

Mikalai Korbit, Adeyemi D. Adeoye, Alberto Bemporad, Mario Zanon

TL;DR

Exact Gauss-Newton (EGN) introduces a stochastic second-order optimizer that uses the Duncan-Guttman identity to compute the Levenberg-Marquardt direction exactly in a low-dimensional space of size $bc$ by solving $(\mathbf{Q}\mathbf{J}\mathbf{J}^{\top} + b\lambda\mathbf{I}_{bc})^{-1}\mathbf{r}$, avoiding the cubic cost of full Hessian inversions. The method relies on the generalized Gauss-Newton approximation $\mathbf{H}^{\mathrm{GN}} = \frac{1}{b}\mathbf{J}^{\top}\mathbf{Q}\mathbf{J}$ and regularizes it with $\lambda\mathbf{I}$, enabling stable updates even in stochastic settings. A convergence analysis shows that, under standard assumptions, the gradient norm converges to zero in expectation, and empirical results demonstrate that EGN often outperforms or matches tuned SGD, Adam, GAF, SQN, and SGN across supervised and reinforcement learning tasks. The work highlights practical enhancements—momentum, line search, and adaptive regularization—to further boost robustness and performance, while noting limitations related to explicit Jacobians, large batch regimes, and multi-class classification with many classes.

Abstract

We present Exact Gauss-Newton (EGN), a stochastic second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction. Leveraging the Duncan-Guttman matrix identity, the parameter update is obtained by factorizing a matrix which has the size of the mini-batch. This is particularly advantageous for large-scale machine learning problems where the dimension of the neural network parameter vector is several orders of magnitude larger than the batch size. Additionally, we show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm. Moreover, under mild assumptions, we prove that our algorithm converges in expectation to a stationary point of the objective. Finally, our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the generalization performance of well-tuned SGD, Adam, GAF, SQN, and SGN optimizers across various supervised and reinforcement learning tasks.

Exact Gauss-Newton Optimization for Training Deep Neural Networks

TL;DR

Exact Gauss-Newton (EGN) introduces a stochastic second-order optimizer that uses the Duncan-Guttman identity to compute the Levenberg-Marquardt direction exactly in a low-dimensional space of size by solving , avoiding the cubic cost of full Hessian inversions. The method relies on the generalized Gauss-Newton approximation and regularizes it with , enabling stable updates even in stochastic settings. A convergence analysis shows that, under standard assumptions, the gradient norm converges to zero in expectation, and empirical results demonstrate that EGN often outperforms or matches tuned SGD, Adam, GAF, SQN, and SGN across supervised and reinforcement learning tasks. The work highlights practical enhancements—momentum, line search, and adaptive regularization—to further boost robustness and performance, while noting limitations related to explicit Jacobians, large batch regimes, and multi-class classification with many classes.

Abstract

We present Exact Gauss-Newton (EGN), a stochastic second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction. Leveraging the Duncan-Guttman matrix identity, the parameter update is obtained by factorizing a matrix which has the size of the mini-batch. This is particularly advantageous for large-scale machine learning problems where the dimension of the neural network parameter vector is several orders of magnitude larger than the batch size. Additionally, we show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm. Moreover, under mild assumptions, we prove that our algorithm converges in expectation to a stationary point of the objective. Finally, our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the generalization performance of well-tuned SGD, Adam, GAF, SQN, and SGN optimizers across various supervised and reinforcement learning tasks.
Paper Structure (43 sections, 4 theorems, 73 equations, 3 figures, 8 tables, 7 algorithms)

This paper contains 43 sections, 4 theorems, 73 equations, 3 figures, 8 tables, 7 algorithms.

Key Result

Theorem 4.1

Assuming $\mathbf{A}$ and $\mathbf{D}$ are full-rank matrices, the following identity holds

Figures (3)

  • Figure 1: Learning curves on the test set for SGD, Adam, GAF, SQN, SGN and EGN. The shaded area represents $\pm1$ standard deviation around the mean (thick line) for $10$ seeds.
  • Figure 2: Learning curves for SGD, Adam, SGN and EGN. The shaded area represents $\pm 1$ standard deviation around the mean return (thick line) for $10$ seeds.
  • Figure 3: Proportion of total update time spent in Algorithm \ref{['alg:slm_direction_finder']} as a function of batch size for fully-connected neural networks (MLPs) of various sizes. Curves show the mean over 1000 runs; shaded regions denote $\pm 1$ standard deviation. Absolute wall-clock times are reported in Appendix \ref{['apx:limitations']}.

Theorems & Definitions (12)

  • Theorem 4.1: duncan1944lxxviiiguttman1946enlargement
  • Lemma 4.2
  • proof
  • Lemma 5.5
  • proof
  • Theorem 5.6
  • proof
  • proof
  • proof
  • proof
  • ...and 2 more