Table of Contents
Fetching ...

A Riemannian Optimization Perspective of the Gauss-Newton Method for Feedforward Neural Networks

Semih Cayci

TL;DR

This work provides non-asymptotic convergence guarantees for Gauss-Newton in training neural networks by leveraging a Riemannian optimization view. It shows that Gauss-Newton preconditioning can dramatically accelerate convergence in ill-conditioned settings, with rates independent of the initial Gram kernel conditioning in the overparameterized regime, and a geodesic-convex analysis yielding last-iterate convergence in the underparameterized regime. Adaptive damping and data-dependent output scaling are key mechanisms that unify over- and underparameterized analyses, enabling fast convergence without requiring favorable conditioning or strong convexity globally. The results extend to deep architectures and offer insight into implicit bias, curvature control, and regime-specific optimization dynamics, with practical implications for designing preconditioned optimizers in deep learning.

Abstract

In this work, we establish non-asymptotic convergence bounds for the Gauss-Newton method in training neural networks with smooth activations. In the underparameterized regime, the Gauss-Newton gradient flow in parameter space induces a Riemannian gradient flow on a low-dimensional embedded submanifold of the function space. Using tools from Riemannian optimization, we establish geodesic Polyak-Lojasiewicz and Lipschitz-smoothness conditions for the loss under appropriately chosen output scaling, yielding geometric convergence to the optimal in-class predictor at an explicit rate independent of the conditioning of the Gram matrix. In the overparameterized regime, we propose adaptive, curvature-aware regularization schedules that ensure fast geometric convergence to a global optimum at a rate independent of the minimum eigenvalue of the neural tangent kernel and, locally, of the modulus of strong convexity of the loss. These results demonstrate that Gauss-Newton achieves accelerated convergence rates in settings where first-order methods exhibit slow convergence due to ill-conditioned kernel matrices and loss landscapes.

A Riemannian Optimization Perspective of the Gauss-Newton Method for Feedforward Neural Networks

TL;DR

This work provides non-asymptotic convergence guarantees for Gauss-Newton in training neural networks by leveraging a Riemannian optimization view. It shows that Gauss-Newton preconditioning can dramatically accelerate convergence in ill-conditioned settings, with rates independent of the initial Gram kernel conditioning in the overparameterized regime, and a geodesic-convex analysis yielding last-iterate convergence in the underparameterized regime. Adaptive damping and data-dependent output scaling are key mechanisms that unify over- and underparameterized analyses, enabling fast convergence without requiring favorable conditioning or strong convexity globally. The results extend to deep architectures and offer insight into implicit bias, curvature control, and regime-specific optimization dynamics, with practical implications for designing preconditioned optimizers in deep learning.

Abstract

In this work, we establish non-asymptotic convergence bounds for the Gauss-Newton method in training neural networks with smooth activations. In the underparameterized regime, the Gauss-Newton gradient flow in parameter space induces a Riemannian gradient flow on a low-dimensional embedded submanifold of the function space. Using tools from Riemannian optimization, we establish geodesic Polyak-Lojasiewicz and Lipschitz-smoothness conditions for the loss under appropriately chosen output scaling, yielding geometric convergence to the optimal in-class predictor at an explicit rate independent of the conditioning of the Gram matrix. In the overparameterized regime, we propose adaptive, curvature-aware regularization schedules that ensure fast geometric convergence to a global optimum at a rate independent of the minimum eigenvalue of the neural tangent kernel and, locally, of the modulus of strong convexity of the loss. These results demonstrate that Gauss-Newton achieves accelerated convergence rates in settings where first-order methods exhibit slow convergence due to ill-conditioned kernel matrices and loss landscapes.

Paper Structure

This paper contains 24 sections, 21 theorems, 196 equations, 3 figures.

Key Result

Lemma 2

Under the Gauss-Newton gradient flow with any $(\rho_t)_{t\in[0,\infty)}$ such that $\rho_t\in(0,1]$. Then, for any $t < T$.

Figures (3)

  • Figure 1: Trajectories of the Gauss-Newton and the gradient flow in the function space and the parameter space for $n=3$ and $p=2$. Gauss-Newton induces Riemannian gradient flow on $\mathcal{M}$.
  • Figure 2: Empirical risk in the overparameterized regime under the Gauss-Newton dynamics with various regularization schemes $\rho=(\rho_t)_{t\geq 0}$. Gradient flow $(\rho_t=1)$ suffers from slow convergence due to the ill-conditioned neural tangent kernel, while the Gauss-Newton with appropriate constant or adaptive damping schedules achieve fast exponential convergence rates.
  • Figure 3: Empirical loss in the underparameterized regime under the Gauss-Newton and gradient flow dynamics for various $\alpha$.

Theorems & Definitions (29)

  • Remark 1: Conditioning of the neural tangent kernel matrix $\bm{K}_0$
  • Lemma 2
  • Lemma 3
  • Lemma 4: Kernel non-degeneracy
  • Theorem 5: Convergence in the overparameterized regime
  • Remark 6: On the benefits of preconditioning
  • Theorem 7: Convergence in the overparameterized regime -- discrete time
  • Remark 8: Hybrid first- and second-order optimizers via adaptive $\rho_t$
  • Corollary 9: Convergence under adaptive damping
  • Theorem 10
  • ...and 19 more