Table of Contents
Fetching ...

Efficient Stochastic BFGS methods Inspired by Bayesian Principles

André Carlon, Luis Espath, Raúl Tempone

TL;DR

The paper addresses the challenge of incorporating second-order information in stochastic optimization by deriving Bayesian-based stochastic quasi-Newton methods. It formulates a probabilistic model over inverse Hessians using curvature pairs and yields the S-BFGS and its memory-efficient variant L-S-BFGS, with closed-form updates and curvature rules to suppress noise amplification. A convergence analysis shows that preconditioned SGD with these updates achieves provable progress under standard assumptions, and experiments on a high-dimensional quadratic problem and large-scale logistic regression demonstrate robustness and scalability, notably with L-S-BFGS outperforming established baselines. The work offers a principled, computationally efficient pathway to leverage curvature information in stochastic contexts and suggests integration with variance-reduction techniques as a future direction.

Abstract

Quasi-Newton methods are ubiquitous in deterministic local search due to their efficiency and low computational cost. This class of methods uses the history of gradient evaluations to approximate second-order derivatives. However, only noisy gradient observations are accessible in stochastic optimization; thus, deriving quasi-Newton methods in this setting is challenging. Although most existing quasi-Newton methods for stochastic optimization rely on deterministic equations that are modified to circumvent noise, we propose a new approach inspired by Bayesian inference to assimilate noisy gradient information and derive the stochastic counterparts to standard quasi-Newton methods. We focus on the derivations of stochastic BFGS and L-BFGS, but our methodology can also be employed to derive stochastic analogs of other quasi-Newton methods. The resulting stochastic BFGS (S-BFGS) and stochastic L-BFGS (L-S-BFGS) can effectively learn an inverse Hessian approximation even with small batch sizes. For a problem of dimension $d$, the iteration cost of S-BFGS is $\mathcal{O}(d^2)$, and the cost of L-S-BFGS is $\mathcal{O}(d)$. Numerical experiments with a dimensionality of up to $30,720$ demonstrate the efficiency and robustness of the proposed method.

Efficient Stochastic BFGS methods Inspired by Bayesian Principles

TL;DR

The paper addresses the challenge of incorporating second-order information in stochastic optimization by deriving Bayesian-based stochastic quasi-Newton methods. It formulates a probabilistic model over inverse Hessians using curvature pairs and yields the S-BFGS and its memory-efficient variant L-S-BFGS, with closed-form updates and curvature rules to suppress noise amplification. A convergence analysis shows that preconditioned SGD with these updates achieves provable progress under standard assumptions, and experiments on a high-dimensional quadratic problem and large-scale logistic regression demonstrate robustness and scalability, notably with L-S-BFGS outperforming established baselines. The work offers a principled, computationally efficient pathway to leverage curvature information in stochastic contexts and suggests integration with variance-reduction techniques as a future direction.

Abstract

Quasi-Newton methods are ubiquitous in deterministic local search due to their efficiency and low computational cost. This class of methods uses the history of gradient evaluations to approximate second-order derivatives. However, only noisy gradient observations are accessible in stochastic optimization; thus, deriving quasi-Newton methods in this setting is challenging. Although most existing quasi-Newton methods for stochastic optimization rely on deterministic equations that are modified to circumvent noise, we propose a new approach inspired by Bayesian inference to assimilate noisy gradient information and derive the stochastic counterparts to standard quasi-Newton methods. We focus on the derivations of stochastic BFGS and L-BFGS, but our methodology can also be employed to derive stochastic analogs of other quasi-Newton methods. The resulting stochastic BFGS (S-BFGS) and stochastic L-BFGS (L-S-BFGS) can effectively learn an inverse Hessian approximation even with small batch sizes. For a problem of dimension , the iteration cost of S-BFGS is , and the cost of L-S-BFGS is . Numerical experiments with a dimensionality of up to demonstrate the efficiency and robustness of the proposed method.

Paper Structure

This paper contains 15 sections, 6 theorems, 46 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Proposition 1

Let $\boldsymbol{W} \in \mathbb{R}^{d \times d}$ be any positive-definite matrix such that $\boldsymbol{W} \boldsymbol{s}_k = \boldsymbol{y}_k$. Then, the update of the stochastic BFGS (S-BFGS), defined as the solution of the minimization problem eq:opt_subproblem with $\boldsymbol{W}_{pr}=\boldsymb

Figures (4)

  • Figure 1: Graphical representation of the Bayesian formulation for quasi-Newton methods. Left: Contour of the negative log prior distribution of $\boldsymbol{H}$, with a blue line representing the affine subspace of matrices satisfying the secant equation and a red $\times$ marking the BFGS update. Center: Contour of the negative log-likelihood distribution. A larger confidence on the observed $\boldsymbol{y}_k$ indicates that it is more likely that the true Hessian is near the affine subspace $\boldsymbol{H} \boldsymbol{y}_k = \boldsymbol{s}_k$. Right: Contour of the negative log posterior for a given $\rho$ and the $\boldsymbol{H}_{k+1}$ that minimizes it. A larger $\rho$ results in $\boldsymbol{H}_{k+1}$ being closer to $\boldsymbol{H}_k$, whereas as $\rho \downarrow 0$, the new inverse Hessian approximation $\boldsymbol{H}_{k+1}$ converges to the one of BFGS.
  • Figure 1: Quadratic problem with a conditioning number of $10^6$: optimality gap vs. iterations for SGD and S-BFGS (top-left), BFGS and S-BFGS (top-right), eigenvalue profiles (bottom-left), and a measure of distance $\Psi$ between $\boldsymbol{H}_k \nabla^2 F(\boldsymbol{x}_k)$ and $\boldsymbol{I}$ (bottom-right). The SGD method becomes stuck and does not progress, possibly due to the large condition number of the problem. The S-BFGS method converges much faster in the initial iterations until it reaches the asymptotic regime, as predicted by Theorem \ref{['thm:conv']}. Vanilla BFGS, however, better approximates the Hessian eigenvalues but fails to converge due to noise amplification.
  • Figure 1: Logistic regression: Optimality gap vs. epoch for 50 independent runs of L-S-BFGS, SdLBFGS, and oLBFGS. Light shade indicates the confidence interval of $90\%$; darker shade presents the confidence interval of $50\%$; and solid lines indicate median values. The robustness of L-S-BFGS to noise allows larger step sizes, offering an advantage in comparison to the baseline methods. As SdLBFGS also has mechanisms to control noise, it performs better than oLBFGS, which requires step sizes as small as $3 \times 10^{-3}$ to converge in the MNIST experiment.
  • Figure 1: Optimality gap versus epochs for the different methods using different step sizes.

Theorems & Definitions (12)

  • Proposition 1
  • Proof 1
  • Lemma 1: Properties of the S-BFGS update
  • Proof 2
  • Lemma 2
  • Proof 3
  • Corollary 1: Bounds on the eigenvalues of preconditioning matrices
  • Proof 4
  • Lemma 3: preconditioned SGD
  • Proof 5
  • ...and 2 more