Table of Contents
Fetching ...

Curvature-Informed SGD via General Purpose Lie-Group Preconditioners

Omead Pooladzandi, Xi-Lin Li

TL;DR

This work tackles the slow convergence of SGD in large-scale, stochastic settings by injecting curvature information through online, Lie-group-based preconditioners. It introduces two general-purpose families, Sparse Matrix-Free XMat and Low-Rank Approximation LRA, that fit the preconditioner on connected Lie groups, enabling equivariant updates and eliminating the need for damping. The authors provide theoretical convergence guarantees to the inverse Hessian and demonstrate strong empirical performance (vision, NLP, RL) with modest computational overhead and robust hyper-parameter behavior. Overall, curvature-informed PSGD offers a practical, scalable optimization tool that yields flatter solutions and improved generalization across diverse deep learning tasks, with open-source code for reproducibility.

Abstract

We present a novel approach to accelerate stochastic gradient descent (SGD) by utilizing curvature information obtained from Hessian-vector products or finite differences of parameters and gradients, similar to the BFGS algorithm. Our approach involves two preconditioners: a matrix-free preconditioner and a low-rank approximation preconditioner. We update both preconditioners online using a criterion that is robust to stochastic gradient noise and does not require line search or damping. To preserve the corresponding symmetry or invariance, our preconditioners are constrained to certain connected Lie groups. The Lie group's equivariance property simplifies the preconditioner fitting process, while its invariance property eliminates the need for damping, which is commonly required in second-order optimizers. As a result, the learning rate for parameter updating and the step size for preconditioner fitting are naturally normalized, and their default values work well in most scenarios. Our proposed approach offers a promising direction for improving the convergence of SGD with low computational overhead. We demonstrate that Preconditioned SGD (PSGD) outperforms SoTA on Vision, NLP, and RL tasks across multiple modern deep-learning architectures. We have provided code for reproducing toy and large scale experiments in this paper.

Curvature-Informed SGD via General Purpose Lie-Group Preconditioners

TL;DR

This work tackles the slow convergence of SGD in large-scale, stochastic settings by injecting curvature information through online, Lie-group-based preconditioners. It introduces two general-purpose families, Sparse Matrix-Free XMat and Low-Rank Approximation LRA, that fit the preconditioner on connected Lie groups, enabling equivariant updates and eliminating the need for damping. The authors provide theoretical convergence guarantees to the inverse Hessian and demonstrate strong empirical performance (vision, NLP, RL) with modest computational overhead and robust hyper-parameter behavior. Overall, curvature-informed PSGD offers a practical, scalable optimization tool that yields flatter solutions and improved generalization across diverse deep learning tasks, with open-source code for reproducibility.

Abstract

We present a novel approach to accelerate stochastic gradient descent (SGD) by utilizing curvature information obtained from Hessian-vector products or finite differences of parameters and gradients, similar to the BFGS algorithm. Our approach involves two preconditioners: a matrix-free preconditioner and a low-rank approximation preconditioner. We update both preconditioners online using a criterion that is robust to stochastic gradient noise and does not require line search or damping. To preserve the corresponding symmetry or invariance, our preconditioners are constrained to certain connected Lie groups. The Lie group's equivariance property simplifies the preconditioner fitting process, while its invariance property eliminates the need for damping, which is commonly required in second-order optimizers. As a result, the learning rate for parameter updating and the step size for preconditioner fitting are naturally normalized, and their default values work well in most scenarios. Our proposed approach offers a promising direction for improving the convergence of SGD with low computational overhead. We demonstrate that Preconditioned SGD (PSGD) outperforms SoTA on Vision, NLP, and RL tasks across multiple modern deep-learning architectures. We have provided code for reproducing toy and large scale experiments in this paper.
Paper Structure (65 sections, 5 theorems, 82 equations, 19 figures, 18 tables, 8 algorithms)

This paper contains 65 sections, 5 theorems, 82 equations, 19 figures, 18 tables, 8 algorithms.

Key Result

Proposition 3.1

Assume that $H$ is invertible, & $dQ=- \mu \frac{\partial c}{\partial Q}$ or $\mathcal{E} = -\mu Q^T \frac{\partial c}{\partial Q}$. Then, $Q$ converges to $\pm |H|^{-0.5}$ by update equation lr_Q2, $Q^{\rm new}= [(Q')^T Q']^{0.5}$ with $Q' = Q^{\rm old} + Q \mathcal{E}$, & a small enough positive

Figures (19)

  • Figure 1: (a) Rosenbrock objective minimization comparison among PSGD & its competitors. Only PSGD shows a clear quadratic convergence curve. (b) MNIST hand written digit recognition with LeNet5. Hessians at the minima of Adam are estimated with a dummy LRA PSGD optimizer that only updates the preconditioner. (c) PSGD is able to significantly outperform closed form solutions for curvature at bfloat16. This exemplifies the stability of our method.
  • Figure 2: CIFAR-10 ResNet-18: (a) Robustness of PSGD: We clearly see as classification task increases in complexity($\rightarrow$), PSGD is able to consistently outperform other first and second order optimizers. (b) Assym Label Noise Train Acc: Accuracy plots based on incorrect noisy labels. PSGD effectively mitigates label noise, learning the true underlying solution with low variance, while other optimizers tend to overfit/memorize the miss-leading trainset. (c) Assym Label Noise Test Acc: Under ground truth test labels, we see that PSGD reaches a significantly better test accuracy with a very low variance compared to Apollo, Adam, and SGD.
  • Figure 3: PSGD outperforms SOTA optimizers on PPO RL on Walker2d & HalfCheetah.
  • Figure 4: Comparison of three diagonal preconditioner fitting methods on a random dense $100\times 100$ Hessian with eigenvalues drawn from the standard uniform distribution.
  • Figure 5: PSGD outperforms SOTA optimizers on PPO RL on Walker2d & HalfCheetah.
  • ...and 14 more figures

Theorems & Definitions (17)

  • Proposition 3.1
  • Corollary 3.1.1
  • Claim 3.1
  • Claim 3.2
  • Claim 3.3
  • Proposition 1.1
  • proof
  • Corollary 1.0.1
  • proof
  • Corollary 1.0.1
  • ...and 7 more