An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes
Antonio Orvieto, Lin Xiao
TL;DR
The paper introduces NGN, an adaptive stochastic gradient method that uses nonnegative Gauss-Newton stepsizes to update parameters for minimizing $f(x)=\frac{1}{N}\sum_i f_i(x)$, where $f_i\ge0$ and smooth. By reformulating losses as $f(x)=r^2(x)$ with $r(x)=\sqrt{f(x)}$, NGN achieves an update with $\gamma_k=\frac{\sigma}{1+\frac{\sigma}{2f(x^k)}\|\nabla f(x^k)\|^2}$ (and its stochastic variant), enabling automatic warmup and decay without knowing the Lipschitz constant $L$ in the convex case. The paper develops a comprehensive stochastic convergence theory for convex, strongly convex, and non-convex settings, including a novel decomposition of the stepsize to control stochastic correlations, and demonstrates that NGN converges to a neighborhood whose size vanishes as $\sigma\to0$, while guaranteeing non-divergence. Empirical results on convex classification tasks and deep learning models show NGN outperforming SGD, SPS, and Adagrad under various hyperparameters and settings, with favorable stability and lower memory usage compared to Adam. The generalized Gauss-Newton perspective ties NGN to second-order ideas, and annealing $\sigma$ yields asymptotic convergence without requiring exact knowledge of $L$, highlighting NGN as a robust, curvature-aware alternative for large-scale stochastic optimization.
Abstract
We consider the problem of minimizing the average of a large number of smooth but possibly non-convex functions. In the context of most machine learning applications, each loss function is non-negative and thus can be expressed as the composition of a square and its real-valued square root. This reformulation allows us to apply the Gauss-Newton method, or the Levenberg-Marquardt method when adding a quadratic regularization. The resulting algorithm, while being computationally as efficient as the vanilla stochastic gradient method, is highly adaptive and can automatically warmup and decay the effective stepsize while tracking the non-negative loss landscape. We provide a tight convergence analysis, leveraging new techniques, in the stochastic convex and non-convex settings. In particular, in the convex case, the method does not require access to the gradient Lipshitz constant for convergence, and is guaranteed to never diverge. The convergence rates and empirical evaluations compare favorably to the classical (stochastic) gradient method as well as to several other adaptive methods.
