Exact Gauss-Newton Optimization for Training Deep Neural Networks
Mikalai Korbit, Adeyemi D. Adeoye, Alberto Bemporad, Mario Zanon
TL;DR
Exact Gauss-Newton (EGN) introduces a stochastic second-order optimizer that uses the Duncan-Guttman identity to compute the Levenberg-Marquardt direction exactly in a low-dimensional space of size $bc$ by solving $(\mathbf{Q}\mathbf{J}\mathbf{J}^{\top} + b\lambda\mathbf{I}_{bc})^{-1}\mathbf{r}$, avoiding the cubic cost of full Hessian inversions. The method relies on the generalized Gauss-Newton approximation $\mathbf{H}^{\mathrm{GN}} = \frac{1}{b}\mathbf{J}^{\top}\mathbf{Q}\mathbf{J}$ and regularizes it with $\lambda\mathbf{I}$, enabling stable updates even in stochastic settings. A convergence analysis shows that, under standard assumptions, the gradient norm converges to zero in expectation, and empirical results demonstrate that EGN often outperforms or matches tuned SGD, Adam, GAF, SQN, and SGN across supervised and reinforcement learning tasks. The work highlights practical enhancements—momentum, line search, and adaptive regularization—to further boost robustness and performance, while noting limitations related to explicit Jacobians, large batch regimes, and multi-class classification with many classes.
Abstract
We present Exact Gauss-Newton (EGN), a stochastic second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction. Leveraging the Duncan-Guttman matrix identity, the parameter update is obtained by factorizing a matrix which has the size of the mini-batch. This is particularly advantageous for large-scale machine learning problems where the dimension of the neural network parameter vector is several orders of magnitude larger than the batch size. Additionally, we show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm. Moreover, under mild assumptions, we prove that our algorithm converges in expectation to a stationary point of the objective. Finally, our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the generalization performance of well-tuned SGD, Adam, GAF, SQN, and SGN optimizers across various supervised and reinforcement learning tasks.
