Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models
Semih Cayci
TL;DR
The paper investigates stochastic Gauss-Newton (SGN) with Levenberg–Marquardt damping for training overparameterized deep nets in regression, deriving non-asymptotic convergence and generalization bounds. Through a variable-metric Lyapunov analysis, it provides a finite-time convergence rate $\mathcal{O}\left(\frac{1}{k}\left[\bar{r}_k\log k+\lambda+\lambda^{-1}\right]+\frac{1}{\sqrt{m}}\right)$ and algorithm-dependent generalization bounds via uniform stability, accounting for curvature, batch size, and overparameterization; the damping factor $\lambda$ mediates a trade-off between optimization and generalization. In the neural-tangent-kernel (NTK) regime, SGN achieves global near-optimality on the training data within a rich NTK function class, with bounds scaling as $\mathcal{O}\left(\frac{\bar{r}_k\log k}{k}+\frac{1}{\sqrt{m}}\right)$; overparameterization improves these bounds. The results demonstrate robustness to ill-conditioning (no need for strict NTK conditioning) and show that cumulative, well-conditioned preconditioning can enhance both optimization and generalization, offering theoretical support for curvature-aware training in deep learning and guidance for choosing damping and batch settings.
Abstract
An important question in deep learning is how higher-order optimization methods affect generalization. In this work, we analyze a stochastic Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch sampling for training overparameterized deep neural networks with smooth activations in a regression setting. Our theoretical contributions are twofold. First, we establish finite-time convergence bounds via a variable-metric analysis in parameter space, with explicit dependencies on the batch size, network width and depth. Second, we derive non-asymptotic generalization bounds for SGN using uniform stability in the overparameterized regime, characterizing the impact of curvature, batch size, and overparameterization on generalization performance. Our theoretical results identify a favorable generalization regime for SGN in which a larger minimum eigenvalue of the Gauss-Newton matrix along the optimization path yields tighter stability bounds.
