Table of Contents
Fetching ...

Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models

Semih Cayci

TL;DR

The paper investigates stochastic Gauss-Newton (SGN) with Levenberg–Marquardt damping for training overparameterized deep nets in regression, deriving non-asymptotic convergence and generalization bounds. Through a variable-metric Lyapunov analysis, it provides a finite-time convergence rate $\mathcal{O}\left(\frac{1}{k}\left[\bar{r}_k\log k+\lambda+\lambda^{-1}\right]+\frac{1}{\sqrt{m}}\right)$ and algorithm-dependent generalization bounds via uniform stability, accounting for curvature, batch size, and overparameterization; the damping factor $\lambda$ mediates a trade-off between optimization and generalization. In the neural-tangent-kernel (NTK) regime, SGN achieves global near-optimality on the training data within a rich NTK function class, with bounds scaling as $\mathcal{O}\left(\frac{\bar{r}_k\log k}{k}+\frac{1}{\sqrt{m}}\right)$; overparameterization improves these bounds. The results demonstrate robustness to ill-conditioning (no need for strict NTK conditioning) and show that cumulative, well-conditioned preconditioning can enhance both optimization and generalization, offering theoretical support for curvature-aware training in deep learning and guidance for choosing damping and batch settings.

Abstract

An important question in deep learning is how higher-order optimization methods affect generalization. In this work, we analyze a stochastic Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch sampling for training overparameterized deep neural networks with smooth activations in a regression setting. Our theoretical contributions are twofold. First, we establish finite-time convergence bounds via a variable-metric analysis in parameter space, with explicit dependencies on the batch size, network width and depth. Second, we derive non-asymptotic generalization bounds for SGN using uniform stability in the overparameterized regime, characterizing the impact of curvature, batch size, and overparameterization on generalization performance. Our theoretical results identify a favorable generalization regime for SGN in which a larger minimum eigenvalue of the Gauss-Newton matrix along the optimization path yields tighter stability bounds.

Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models

TL;DR

The paper investigates stochastic Gauss-Newton (SGN) with Levenberg–Marquardt damping for training overparameterized deep nets in regression, deriving non-asymptotic convergence and generalization bounds. Through a variable-metric Lyapunov analysis, it provides a finite-time convergence rate and algorithm-dependent generalization bounds via uniform stability, accounting for curvature, batch size, and overparameterization; the damping factor mediates a trade-off between optimization and generalization. In the neural-tangent-kernel (NTK) regime, SGN achieves global near-optimality on the training data within a rich NTK function class, with bounds scaling as ; overparameterization improves these bounds. The results demonstrate robustness to ill-conditioning (no need for strict NTK conditioning) and show that cumulative, well-conditioned preconditioning can enhance both optimization and generalization, offering theoretical support for curvature-aware training in deep learning and guidance for choosing damping and batch settings.

Abstract

An important question in deep learning is how higher-order optimization methods affect generalization. In this work, we analyze a stochastic Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch sampling for training overparameterized deep neural networks with smooth activations in a regression setting. Our theoretical contributions are twofold. First, we establish finite-time convergence bounds via a variable-metric analysis in parameter space, with explicit dependencies on the batch size, network width and depth. Second, we derive non-asymptotic generalization bounds for SGN using uniform stability in the overparameterized regime, characterizing the impact of curvature, batch size, and overparameterization on generalization performance. Our theoretical results identify a favorable generalization regime for SGN in which a larger minimum eigenvalue of the Gauss-Newton matrix along the optimization path yields tighter stability bounds.

Paper Structure

This paper contains 35 sections, 16 theorems, 178 equations.

Key Result

Lemma 1

For any compact and convex $\mathcal{C}\subset \mathbb{R}^p$ with let $\kappa_0:=\max_{h\in[H]}\frac{\|W_0^{(h)}\|_2}{\sqrt{m}}\hbox{and}\zeta_0:=\|c_0\|.$ Also, let $\kappa_\mathcal{C} = \kappa_0+\frac{r_{\mathcal{C}}}{\sqrt{m}}$ and $\zeta_\mathcal{C}:=\zeta_0+\frac{r_{\mathcal{C}}}{\sqrt{m}}$. We have the following (local) Lipschitz continuity results in $\math

Theorems & Definitions (35)

  • Lemma 1
  • Remark 1
  • Remark 2: Beyond quadratic loss
  • Theorem 1: Finite-Time Bounds for SGN
  • Remark 3
  • Proposition 1
  • Remark 4
  • Corollary 1: Near-optimality in $\mathcal{F}_{\textsc{ntk}}$
  • Lemma 2: Stability with midpoint metric
  • Theorem 2: Uniform Stability of SGN
  • ...and 25 more