A Riemannian Optimization Perspective of the Gauss-Newton Method for Feedforward Neural Networks
Semih Cayci
TL;DR
This work provides non-asymptotic convergence guarantees for Gauss-Newton in training neural networks by leveraging a Riemannian optimization view. It shows that Gauss-Newton preconditioning can dramatically accelerate convergence in ill-conditioned settings, with rates independent of the initial Gram kernel conditioning in the overparameterized regime, and a geodesic-convex analysis yielding last-iterate convergence in the underparameterized regime. Adaptive damping and data-dependent output scaling are key mechanisms that unify over- and underparameterized analyses, enabling fast convergence without requiring favorable conditioning or strong convexity globally. The results extend to deep architectures and offer insight into implicit bias, curvature control, and regime-specific optimization dynamics, with practical implications for designing preconditioned optimizers in deep learning.
Abstract
In this work, we establish non-asymptotic convergence bounds for the Gauss-Newton method in training neural networks with smooth activations. In the underparameterized regime, the Gauss-Newton gradient flow in parameter space induces a Riemannian gradient flow on a low-dimensional embedded submanifold of the function space. Using tools from Riemannian optimization, we establish geodesic Polyak-Lojasiewicz and Lipschitz-smoothness conditions for the loss under appropriately chosen output scaling, yielding geometric convergence to the optimal in-class predictor at an explicit rate independent of the conditioning of the Gram matrix. In the overparameterized regime, we propose adaptive, curvature-aware regularization schedules that ensure fast geometric convergence to a global optimum at a rate independent of the minimum eigenvalue of the neural tangent kernel and, locally, of the modulus of strong convexity of the loss. These results demonstrate that Gauss-Newton achieves accelerated convergence rates in settings where first-order methods exhibit slow convergence due to ill-conditioned kernel matrices and loss landscapes.
