Table of Contents
Fetching ...

Advancing Multi-Secant Quasi-Newton Methods for General Convex Functions

Mokhwa Lee, Yifan Sun

TL;DR

The paper tackles stability challenges in multisecant quasi-Newton methods for general convex functions by introducing a cheap diagonal PSD perturbation combined with symmetrization. It proves a local $q$-superlinear convergence result under standard smoothness and convexity assumptions while decaying the perturbation to zero, and demonstrates the method's practical competitiveness through extensive numerical experiments. The authors also extend the approach to limited-memory settings (L-BFGS) and explore nonconvex neural network training, showing improved convergence in ill-conditioned scenarios but highlighting stability concerns requiring adaptive techniques. Overall, multisecant QN with PSD perturbation offers a meaningful advance over single-secant updates, providing faster convergence in challenging landscapes and a viable path toward scalable, higher-order optimization in machine learning and scientific computing.

Abstract

Quasi-Newton (QN) methods provide an efficient alternative to second-order methods for minimizing smooth unconstrained problems. While QN methods generally compose a Hessian estimate based on one secant interpolation per iteration, multisecant methods use multiple secant interpolations and can improve the quality of the Hessian estimate at small additional overhead cost. However, implementing multisecant QN methods has several key challenges involving method stability, the most critical of which is that when the objective function is convex but not quadratic, the Hessian approximate is not, in general, symmetric positive semidefinite (PSD), and the steps are not guaranteed to be descent directions. We therefore investigate a symmetrized and PSD-perturbed Hessian approximation method for multisecant QN. We offer an efficiently computable method for producing the PSD perturbation, show superlinear convergence of the new method, and demonstrate improved numerical experiments over general convex minimization problems. We also investigate the limited memory extension of the method, focusing on BFGS, on both convex and non-convex functions. Our results suggest that in ill-conditioned optimization landscapes, leveraging multiple secants can accelerate convergence and yield higher-quality solutions compared to traditional single-secant methods.

Advancing Multi-Secant Quasi-Newton Methods for General Convex Functions

TL;DR

The paper tackles stability challenges in multisecant quasi-Newton methods for general convex functions by introducing a cheap diagonal PSD perturbation combined with symmetrization. It proves a local -superlinear convergence result under standard smoothness and convexity assumptions while decaying the perturbation to zero, and demonstrates the method's practical competitiveness through extensive numerical experiments. The authors also extend the approach to limited-memory settings (L-BFGS) and explore nonconvex neural network training, showing improved convergence in ill-conditioned scenarios but highlighting stability concerns requiring adaptive techniques. Overall, multisecant QN with PSD perturbation offers a meaningful advance over single-secant updates, providing faster convergence in challenging landscapes and a viable path toward scalable, higher-order optimization in machine learning and scientific computing.

Abstract

Quasi-Newton (QN) methods provide an efficient alternative to second-order methods for minimizing smooth unconstrained problems. While QN methods generally compose a Hessian estimate based on one secant interpolation per iteration, multisecant methods use multiple secant interpolations and can improve the quality of the Hessian estimate at small additional overhead cost. However, implementing multisecant QN methods has several key challenges involving method stability, the most critical of which is that when the objective function is convex but not quadratic, the Hessian approximate is not, in general, symmetric positive semidefinite (PSD), and the steps are not guaranteed to be descent directions. We therefore investigate a symmetrized and PSD-perturbed Hessian approximation method for multisecant QN. We offer an efficiently computable method for producing the PSD perturbation, show superlinear convergence of the new method, and demonstrate improved numerical experiments over general convex minimization problems. We also investigate the limited memory extension of the method, focusing on BFGS, on both convex and non-convex functions. Our results suggest that in ill-conditioned optimization landscapes, leveraging multiple secants can accelerate convergence and yield higher-quality solutions compared to traditional single-secant methods.

Paper Structure

This paper contains 44 sections, 19 theorems, 178 equations, 6 figures, 17 tables, 1 algorithm.

Key Result

Theorem 3.1

Consider $W$ a non-symmetric matrix, $c>0$ and Then $\Delta + \mu I$ is PSD if and only if is PSD, for $G = $ and $F = VSU^\top$. Here, $W = U\Sigma V^\top$ is the SVD of $W$, and $S_{i,i} = \frac{-1+\sqrt{1+4c^2\Sigma_{ii}^{-2}}}{2\Sigma_{i,i}^{-1}}$. Here, $\tilde{q} = q$ for Broyden's method, and $\tilde{q} = 2q$ for Powell, DFP, and BFGS.

Figures (6)

  • Figure 2: Comparison of Newton, gradient descent (GD), single-secant QN methods (S), and multi-secant QN methods (M) on logistic regression with $m = 200, n = 100, q = 5$. Top: direct solve. Bottom: Woodbury inverse. Both high (H) signal and low (L) signal regime problems are tested.
  • Figure 3: Comparison of QN method improvements, including symmetrization, PSD projection, and our simple diagonal boost. The problem sizes are $m=200,n=100$ and $q=5$ for multisecant methods. All are using Woodbury inverse update. Top: secants built using curve-hugging. Bottom: secants built using anchored at most recent. The problem is $\bar{c} = 30$ (H).
  • Figure 4: Ablation of several techniques: PSD correction ($\nu > 0$), scaling, and rejection. The problem sizes are $m=200,n=100$ and $q=5$ for multisecant methods.
  • Figure 5: Performance of L-MS-BFGS on logistic regression. AMS = almost multi-secant (our method). Top. r = rejection. Bottom. no rejection or scaling used. The problem sizes are $m=2000,n=1000$.
  • Figure 6: Runtime of various methods. d = direct update, i = inverse update. 1 = single-secant, v = vanilla multisecant, p = with PSD correction (infeasible in practice), o = with diagonal correction. For L-MS-BFGS, the first number is $L$, the limited memory size. For all MS methods, $q = 5$.
  • ...and 1 more figures

Theorems & Definitions (37)

  • Theorem 3.1
  • proof
  • Theorem 4.1: $q$-superlinear conv.
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Lemma A.3: Smoothness for vectors
  • proof
  • Lemma A.4: Primal dual contraction
  • ...and 27 more