Table of Contents
Fetching ...

Fast Unconstrained Optimization via Hessian Averaging and Adaptive Gradient Sampling Methods

Thomas O'Leary-Roseberry, Raghu Bollapragada

TL;DR

This work develops Hessian-averaged Newton methods that tolerate gradient inexactness through adaptive gradient sampling, while keeping a fixed per-iteration Hessian cost. It introduces deterministic and stochastic variants, proving global linear and sublinear convergence, and a local superlinear rate for deterministic cyclic Hessian sampling, with an additional $O\left(\frac{1}{\sqrt{k}}\right)$ local superlinear rate in the stochastic setting under expectation. A practical diagonally-averaged Newton (Dan) variant is proposed to enable scalable, matrix-free implementations via Hessian-vector products and diagonal estimators, with Dan2 providing an alternative diagonal-approximation scheme. Numerical experiments on stochastic quadratic problems, logistic regression, CIFAR-10/100 with ResNets, and derivative-informed neural operators for parametric PDEs demonstrate that Hessian averaging improves stability and often achieves state-of-the-art or competitive performance, particularly when combined with adaptive gradient sampling. Overall, the framework offers theoretically grounded, scalable second-order optimization tools that rival first-order methods in practice, while delivering faster local convergence and robust performance on challenging ML and scientific computing tasks.

Abstract

We consider minimizing finite-sum and expectation objective functions via Hessian-averaging based subsampled Newton methods. These methods allow for gradient inexactness and have fixed per-iteration Hessian approximation costs. The recent work (Na et al. 2023) demonstrated that Hessian averaging can be utilized to achieve fast $\mathcal{O}\left(\sqrt{\tfrac{\log k}{k}}\right)$ local superlinear convergence for strongly convex functions in high probability, while maintaining fixed per-iteration Hessian costs. These methods, however, require gradient exactness and strong convexity, which poses challenges for their practical implementation. To address this concern we consider Hessian-averaged methods that allow gradient inexactness via norm condition based adaptive-sampling strategies. For the finite-sum problem we utilize deterministic sampling techniques which lead to global linear and sublinear convergence rates for strongly convex and nonconvex functions respectively. In this setting we are able to derive an improved deterministic local superlinear convergence rate of $\mathcal{O}\left(\tfrac{1}{k}\right)$. For the %expected risk expectation problem we utilize stochastic sampling techniques, and derive global linear and sublinear rates for strongly convex and nonconvex functions, as well as a $\mathcal{O}\left(\tfrac{1}{\sqrt{k}}\right)$ local superlinear convergence rate, all in expectation. We present novel analysis techniques that differ from the previous probabilistic results. Additionally, we propose scalable and efficient variations of these methods via diagonal approximations and derive the novel diagonally-averaged Newton (Dan) method for large-scale problems. Our numerical results demonstrate that the Hessian averaging not only helps with convergence, but can lead to state-of-the-art performance on difficult problems such as CIFAR100 classification with ResNets.

Fast Unconstrained Optimization via Hessian Averaging and Adaptive Gradient Sampling Methods

TL;DR

This work develops Hessian-averaged Newton methods that tolerate gradient inexactness through adaptive gradient sampling, while keeping a fixed per-iteration Hessian cost. It introduces deterministic and stochastic variants, proving global linear and sublinear convergence, and a local superlinear rate for deterministic cyclic Hessian sampling, with an additional local superlinear rate in the stochastic setting under expectation. A practical diagonally-averaged Newton (Dan) variant is proposed to enable scalable, matrix-free implementations via Hessian-vector products and diagonal estimators, with Dan2 providing an alternative diagonal-approximation scheme. Numerical experiments on stochastic quadratic problems, logistic regression, CIFAR-10/100 with ResNets, and derivative-informed neural operators for parametric PDEs demonstrate that Hessian averaging improves stability and often achieves state-of-the-art or competitive performance, particularly when combined with adaptive gradient sampling. Overall, the framework offers theoretically grounded, scalable second-order optimization tools that rival first-order methods in practice, while delivering faster local convergence and robust performance on challenging ML and scientific computing tasks.

Abstract

We consider minimizing finite-sum and expectation objective functions via Hessian-averaging based subsampled Newton methods. These methods allow for gradient inexactness and have fixed per-iteration Hessian approximation costs. The recent work (Na et al. 2023) demonstrated that Hessian averaging can be utilized to achieve fast local superlinear convergence for strongly convex functions in high probability, while maintaining fixed per-iteration Hessian costs. These methods, however, require gradient exactness and strong convexity, which poses challenges for their practical implementation. To address this concern we consider Hessian-averaged methods that allow gradient inexactness via norm condition based adaptive-sampling strategies. For the finite-sum problem we utilize deterministic sampling techniques which lead to global linear and sublinear convergence rates for strongly convex and nonconvex functions respectively. In this setting we are able to derive an improved deterministic local superlinear convergence rate of . For the %expected risk expectation problem we utilize stochastic sampling techniques, and derive global linear and sublinear rates for strongly convex and nonconvex functions, as well as a local superlinear convergence rate, all in expectation. We present novel analysis techniques that differ from the previous probabilistic results. Additionally, we propose scalable and efficient variations of these methods via diagonal approximations and derive the novel diagonally-averaged Newton (Dan) method for large-scale problems. Our numerical results demonstrate that the Hessian averaging not only helps with convergence, but can lead to state-of-the-art performance on difficult problems such as CIFAR100 classification with ResNets.
Paper Structure (50 sections, 23 theorems, 148 equations, 5 figures, 7 tables)

This paper contains 50 sections, 23 theorems, 148 equations, 5 figures, 7 tables.

Key Result

Lemma 2.1

Suppose Assumption assum:bnd_var holds. For any $k \in \mathbb{Z}^+$ and $\lambda_{\min}(A_k), \lambda_{\max}(A_k) \in (0,\infty)$ denote the smallest and largest eigenvalues of $A_k$ respectively, we have that

Figures (5)

  • Figure 1: Overview of the results presented in this section. We characterize the main results for global and local convergence results, and their relationship to the problem constants and algorithmic hyperparameters. Here, $N$ denotes the number of data, $\mu$ is the Hessian spectral lower bound, $\kappa = \frac{L}{\mu}$ is a condition-number like constant, and $M$ is the Hessian Lipschitz constant.
  • Figure 2: Overview of the results presented in this section. We characterize the main results for global and local convergence results, and their relationship to the problem constants and algorithmic hyperparameters. Here $\mu$ is the Hessian spectral lower bound, $\kappa = \frac{L}{\mu}$ is a condition-number like constant, and $M$ is the Hessian Lipschitz constant. We note that there are additional global sublinear and linear results that allow any $0<\tilde{\theta}_g<\infty$, but we do not annotate them in this figure for simplicity.
  • Figure 3: Efficiently implemented (vectorized) Hessian subspace products for varying ranks and sample sizes have approximately constant compute time until running out of GPU memory. These experiments are for a ResNet used in CIFAR100 classification, as shown in Section \ref{['subsection:cifar_results']}. The dimension of the neural network weights $w$ was $d =11,247,052$. This set of experiments was run on an NVIDIA L40S GPU which has 48GB of GPU RAM.
  • Figure 4: The performance of the best 4 methods for the stochastic quadratic minimization problem. Dan with adaptive gradient sampling performed the best, both in terms of fast convergence and $f(w^\dagger)$. SGD with adaptive gradient sampling also performed well, but required significant limitations on the step size. The methods without adaptive gradient sampling plateaued with larger $f(w^\dagger)$. The averaging of the Hessian can eventually overcome stability issues after enough iterations progress to reduce the Hessian variance as seen by Dan exp $\alpha = 1.0$. When not using Hessian averaging, Newton methods required smaller steps to maintain stability.
  • Figure 5: Comparison of the best runs for Adahessian, Adam, Dan and Dan2 in regards to epoch equivalent work. The best Adahessian happened to have lower Hessian ranks in the approximation than the best Dan and Dan2 for the CIFAR10, while for CIFAR100 the rank 1 methods shown all performed the best of individual runs. The CIFAR10 performance is quite similar for all four methods, while for CIFAR100 the Hessian-averaged methods substantially out-performed Adam; particularly Dan gave the best generalization accuracy (73.82% for the run shown).

Theorems & Definitions (63)

  • Remark 2.1
  • Remark 2.2
  • Lemma 2.1
  • proof
  • Lemma 3.1
  • proof
  • Theorem 3.2
  • proof
  • Remark 3.1
  • Theorem 3.3
  • ...and 53 more