Table of Contents
Fetching ...

Stochastic Hessian Fittings with Lie Groups

Xi-Lin Li

TL;DR

The paper develops a unified framework for stochastic Hessian fitting using the PSGD preconditioner criterion, connecting classical second-order methods (BFGS, Gauss-Newton, natural gradient) with modern inverse-free, Lie-group based preconditioner updates. It proves convexity properties in SPD and Lie-group geometries, with strong convexity on a polar-quotient of GL(n,R), enabling linear convergence of SGD-type updates. It introduces multiple inverse-free and sparse Lie-group preconditioners (diagonal, Kronecker, low-rank) and presents both theoretical and empirical results showing robust performance in noisy and time-varying settings, plus practical algorithms for large-scale problems. The work also establishes practical links to Newton-Schulz iterations and demonstrates the approach on tensor decomposition and transformer/GPT-scale tasks, highlighting improved stability and convergence without expensive inverses or decompositions. Overall, the framework offers scalable, robust second-order optimization tools for stochastic problems across Euclidean, SPD, and Lie-group geometries, with concrete methods and empirical validation.

Abstract

This report investigates the fitting of the Hessian or its inverse for stochastic optimizations using a Hessian fitting criterion derived from the preconditioned stochastic gradient descent (PSGD) method. This criterion is closely related to many widely used second-order and adaptive gradient optimization methods, including BFGS, the Gauss-Newton algorithm, natural gradient descent, and AdaGrad. Our analyses reveal the efficiency and reliability differences of a broad range of preconditioner fitting methods, ranging from closed-form to iterative approaches, using Hessian-vector products or stochastic gradients only, with Hessian fittings across various geometric settings (the Euclidean space, the manifold of symmetric positive definite (SPD) matrices, and a variety of Lie groups). The most intriguing finding is that the Hessian fitting problem is strongly convex under mild conditions in certain general Lie groups. This result turns Hessian fitting into a well-behaved Lie group optimization problem and facilitates the design of highly efficient and elegant Lie group sparse preconditioner fitting methods for large-scale stochastic optimizations.

Stochastic Hessian Fittings with Lie Groups

TL;DR

The paper develops a unified framework for stochastic Hessian fitting using the PSGD preconditioner criterion, connecting classical second-order methods (BFGS, Gauss-Newton, natural gradient) with modern inverse-free, Lie-group based preconditioner updates. It proves convexity properties in SPD and Lie-group geometries, with strong convexity on a polar-quotient of GL(n,R), enabling linear convergence of SGD-type updates. It introduces multiple inverse-free and sparse Lie-group preconditioners (diagonal, Kronecker, low-rank) and presents both theoretical and empirical results showing robust performance in noisy and time-varying settings, plus practical algorithms for large-scale problems. The work also establishes practical links to Newton-Schulz iterations and demonstrates the approach on tensor decomposition and transformer/GPT-scale tasks, highlighting improved stability and convergence without expensive inverses or decompositions. Overall, the framework offers scalable, robust second-order optimization tools for stochastic problems across Euclidean, SPD, and Lie-group geometries, with concrete methods and empirical validation.

Abstract

This report investigates the fitting of the Hessian or its inverse for stochastic optimizations using a Hessian fitting criterion derived from the preconditioned stochastic gradient descent (PSGD) method. This criterion is closely related to many widely used second-order and adaptive gradient optimization methods, including BFGS, the Gauss-Newton algorithm, natural gradient descent, and AdaGrad. Our analyses reveal the efficiency and reliability differences of a broad range of preconditioner fitting methods, ranging from closed-form to iterative approaches, using Hessian-vector products or stochastic gradients only, with Hessian fittings across various geometric settings (the Euclidean space, the manifold of symmetric positive definite (SPD) matrices, and a variety of Lie groups). The most intriguing finding is that the Hessian fitting problem is strongly convex under mild conditions in certain general Lie groups. This result turns Hessian fitting into a well-behaved Lie group optimization problem and facilitates the design of highly efficient and elegant Lie group sparse preconditioner fitting methods for large-scale stochastic optimizations.
Paper Structure (44 sections, 110 equations, 7 figures, 3 tables)

This paper contains 44 sections, 110 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a): Typical convergence curves for the five Hessian fitting methods in Table II when $H$ is a $3\times 3$ Hilbert matrix, i.e., $H_{i,j}=1/(i+j-1)$ with $1\le i,j\le n$. (b) and (c): Zooming in of the converge curves of SGD in group GL$(n, \mathbb{R})$ and Newton's method, respectively, for better visualization.
  • Figure 2: (a) Typical convergence curves when $H$ is a $50\times 50$ matrix with $H_{i,i}=1$ and $H_{i,j}=0.5$ for $|i-j|=1$. Its eigenvalues are bounded in range $(0, 2)$. (b) The same $H$ as in (a), but with a noisy model of (\ref{['hv_model']}), where $\epsilon \sim \mathcal{N}(0, \sigma_{\epsilon}^2I )$, and $\sigma_{\epsilon}=0.01$. (c) Time varying Hessians defined by process $H_{t+1}=H_t + uu^T$, where $u_i\sim \mathcal{U}(0,1)$ for $1\le i\le n=50$, and $H_0$ is a matrix with all elements being $1/4$.
  • Figure 3: Plots illustrating the possible shapes of $f(a)={\rm tr}(e^{aR}Q - Q)$ with different orders of approximation of $e^{aR}$, where $R=Q^T - Q$ is the rotation group generator. Note that ${\rm tr}(RQ)\ge 0$ and ${\rm tr}(R^3Q)\le 0$ and the equal sign holds only when $R=0$.
  • Figure 4: The five Hessian fitting methods in (\ref{['standard_method']}), (\ref{['update_QEQ']}), (\ref{['update_quad1']}), (\ref{['update_quad2']}) and (\ref{['update_QEP']}) are compared on whitening gradient covariance matrix $H={\rm hilb}(64) + 10^{-6}I$. We start from $Q_0=I$, set $\beta=1$, and $\mu=1$ for (\ref{['standard_method']}) and (\ref{['update_QEP']}), and $\mu=0.1$ for (\ref{['update_QEQ']}) (\ref{['update_quad1']}), and (\ref{['update_quad2']}).
  • Figure 5: Comparisons on the tensor rank decomposition problem of (\ref{['trd']}), where $R=10, I=20, J=50$, $K=100$, $\tau_{i,j,k}= \sum_{r=1}^{R} a_{ri} b_{rj}c_{rk}$, and all the elements of $a$, $b$ and $c$ are drawn from $\mathcal{N}(0, 1)$. Except for the PSGD LRA optimizer, the inverse-free PSGD with local coordinate $dQ=Q^{0.5}\mathcal{E}Q^{1.5}$ is used.
  • ...and 2 more figures