Table of Contents
Fetching ...

An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes

Antonio Orvieto, Lin Xiao

TL;DR

The paper introduces NGN, an adaptive stochastic gradient method that uses nonnegative Gauss-Newton stepsizes to update parameters for minimizing $f(x)=\frac{1}{N}\sum_i f_i(x)$, where $f_i\ge0$ and smooth. By reformulating losses as $f(x)=r^2(x)$ with $r(x)=\sqrt{f(x)}$, NGN achieves an update with $\gamma_k=\frac{\sigma}{1+\frac{\sigma}{2f(x^k)}\|\nabla f(x^k)\|^2}$ (and its stochastic variant), enabling automatic warmup and decay without knowing the Lipschitz constant $L$ in the convex case. The paper develops a comprehensive stochastic convergence theory for convex, strongly convex, and non-convex settings, including a novel decomposition of the stepsize to control stochastic correlations, and demonstrates that NGN converges to a neighborhood whose size vanishes as $\sigma\to0$, while guaranteeing non-divergence. Empirical results on convex classification tasks and deep learning models show NGN outperforming SGD, SPS, and Adagrad under various hyperparameters and settings, with favorable stability and lower memory usage compared to Adam. The generalized Gauss-Newton perspective ties NGN to second-order ideas, and annealing $\sigma$ yields asymptotic convergence without requiring exact knowledge of $L$, highlighting NGN as a robust, curvature-aware alternative for large-scale stochastic optimization.

Abstract

We consider the problem of minimizing the average of a large number of smooth but possibly non-convex functions. In the context of most machine learning applications, each loss function is non-negative and thus can be expressed as the composition of a square and its real-valued square root. This reformulation allows us to apply the Gauss-Newton method, or the Levenberg-Marquardt method when adding a quadratic regularization. The resulting algorithm, while being computationally as efficient as the vanilla stochastic gradient method, is highly adaptive and can automatically warmup and decay the effective stepsize while tracking the non-negative loss landscape. We provide a tight convergence analysis, leveraging new techniques, in the stochastic convex and non-convex settings. In particular, in the convex case, the method does not require access to the gradient Lipshitz constant for convergence, and is guaranteed to never diverge. The convergence rates and empirical evaluations compare favorably to the classical (stochastic) gradient method as well as to several other adaptive methods.

An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes

TL;DR

The paper introduces NGN, an adaptive stochastic gradient method that uses nonnegative Gauss-Newton stepsizes to update parameters for minimizing , where and smooth. By reformulating losses as with , NGN achieves an update with (and its stochastic variant), enabling automatic warmup and decay without knowing the Lipschitz constant in the convex case. The paper develops a comprehensive stochastic convergence theory for convex, strongly convex, and non-convex settings, including a novel decomposition of the stepsize to control stochastic correlations, and demonstrates that NGN converges to a neighborhood whose size vanishes as , while guaranteeing non-divergence. Empirical results on convex classification tasks and deep learning models show NGN outperforming SGD, SPS, and Adagrad under various hyperparameters and settings, with favorable stability and lower memory usage compared to Adam. The generalized Gauss-Newton perspective ties NGN to second-order ideas, and annealing yields asymptotic convergence without requiring exact knowledge of , highlighting NGN as a robust, curvature-aware alternative for large-scale stochastic optimization.

Abstract

We consider the problem of minimizing the average of a large number of smooth but possibly non-convex functions. In the context of most machine learning applications, each loss function is non-negative and thus can be expressed as the composition of a square and its real-valued square root. This reformulation allows us to apply the Gauss-Newton method, or the Levenberg-Marquardt method when adding a quadratic regularization. The resulting algorithm, while being computationally as efficient as the vanilla stochastic gradient method, is highly adaptive and can automatically warmup and decay the effective stepsize while tracking the non-negative loss landscape. We provide a tight convergence analysis, leveraging new techniques, in the stochastic convex and non-convex settings. In particular, in the convex case, the method does not require access to the gradient Lipshitz constant for convergence, and is guaranteed to never diverge. The convergence rates and empirical evaluations compare favorably to the classical (stochastic) gradient method as well as to several other adaptive methods.
Paper Structure (28 sections, 11 theorems, 129 equations, 7 figures, 3 tables)

This paper contains 28 sections, 11 theorems, 129 equations, 7 figures, 3 tables.

Key Result

Lemma 2.1

Suppose $f:\mathbf{R}^d\to\mathbf{R}$ is non-negative, differentiable and $L$-smooth. Then the NGN-det stepsize given in eq:NGN-det-gamma satisfies

Figures (7)

  • Figure 1: NGN and GD updates and corresponding objective function estimate in Eqs. \ref{['eq:NGN_estimate']} and \ref{['eq:GD_estimate']} on a few toy examples (inspired by chen2011hessian). The black dot denotes the initial $x$, ald the star is the position after one step: $x+p$. Compared to GD with stepsize $\gamma=\sigma$, NGN is more conservative if the landscape is sharp. Note that the function approximation provided by NGN is always non-negative, as clear from our motivation and the algorithm derivation.
  • Figure 2: Optimization dynamics of constant-stepsize GD, NGN, and APS$_{\max}$ on the toy example $f(x) = \frac{\lambda}{2} (x-x^*)^2+f^*$, for different hyperparameter values. NGN is stable for any finite $\sigma>0$. Dashed line in the bottom row is the value $2/\lambda$.
  • Figure 3: Deterministic NGN compared with constant-stepsize Gradient Descent, Polyak Stepsizes, and Armijo Line search on three classification datasets listed in Table \ref{['tab:dataset']}. The center column shows the optimality gap $f(x^k)-f^*$, the right column plots the evolution of stepsizes, and the right column shows the projection of the iterate trajectories onto the top-two PCA components. On the trajectories plot, the circle denotes the starting point, the star denotes the solution found at the last iteration, and the square represents the point after one iteration.
  • Figure 4: Comparison of NGN with Adagrad-norm and SGD for two convex problems, under a decreasing stepsize and a constant stepsize. A comment is provided below.
  • Figure 5: (SGD vs. Adam vs. NGN) Experimental results on Deep Neural Networks (stochastic gradients). All details and comments can be found in the text. Shown is performance for five or three hyperparameters (Table \ref{['tb:hyper']}), each method is tuned to best at hyperparameter #2.
  • ...and 2 more figures

Theorems & Definitions (22)

  • Lemma 2.1: Stepsize bounds
  • Lemma 4.1: Fundamental Equality
  • proof
  • Lemma 4.2: Fundamental Inequality
  • proof
  • Remark : On the choice of $\delta$ in \ref{['eq:delta-choice']}
  • Definition 4.3: Interpolation
  • Remark
  • Definition 4.4: Strong Convexity / Convexity
  • Theorem 4.5: NGN, convex
  • ...and 12 more