Table of Contents
Fetching ...

Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning

Wu Lin, Valentin Duruisseaux, Melvin Leok, Frank Nielsen, Mohammad Emtiyaz Khan, Mark Schmidt

TL;DR

This work proposes a generalized version of the Riemannian normal coordinates that dynamically orthonormalizes the metric and locally converts the problem into an unconstrained problem in the Euclidean space.

Abstract

Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of sparse or structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riemannian normal coordinates that dynamically orthonormalizes the metric and locally converts the problem into an unconstrained problem in the Euclidean space. We use our approach to simplify existing approaches for structured covariances and develop matrix-inverse-free $2^\text{nd}$-order optimizers for deep learning with low precision by using only matrix multiplications. Code: https://github.com/yorkerlin/StructuredNGD-DL

Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning

TL;DR

This work proposes a generalized version of the Riemannian normal coordinates that dynamically orthonormalizes the metric and locally converts the problem into an unconstrained problem in the Euclidean space.

Abstract

Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of sparse or structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riemannian normal coordinates that dynamically orthonormalizes the metric and locally converts the problem into an unconstrained problem in the Euclidean space. We use our approach to simplify existing approaches for structured covariances and develop matrix-inverse-free -order optimizers for deep learning with low precision by using only matrix multiplications. Code: https://github.com/yorkerlin/StructuredNGD-DL
Paper Structure (44 sections, 109 equations, 6 figures, 8 tables)

This paper contains 44 sections, 109 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: A (orthonormal) SNC/GNC is generated at each iteration.
  • Figure 2: In our update, we denote $\hbox{$\hbox{$\mathbf{H}$}$}_K := \hbox{$\hbox{$\mathbf{K}$}$}^T \hbox{$\hbox{$\boldsymbol{\mu}$}$}_{AA} \hbox{$\hbox{$\mathbf{K}$}$}\,$, $\hbox{$\hbox{$\mathbf{H}$}$}_C := \hbox{$\hbox{$\mathbf{C}$}$}^T \hbox{$\hbox{$\boldsymbol{\mu}$}$}_{GG} \hbox{$\hbox{$\mathbf{C}$}$}\,$, $\kappa^2 := \lambda\mathrm{Tr}(\hbox{$\hbox{$\mathbf{K}$}$}^T\hbox{$\hbox{$\mathbf{K}$}$})\,$, and $c^2 := \lambda\mathrm{Tr}(\hbox{$\hbox{$\mathbf{C}$}$}^T\hbox{$\hbox{$\mathbf{C}$}$})$, where $\mathrm{vec}^{-1}(\hbox{$\hbox{$\boldsymbol{\mu}$}$}) \in \hbox{$\mathbb{R}$}^{d \times p}$, $\hbox{$\hbox{$\mathbf{C}$}$} \in \hbox{$\mathbb{R}$}^{d \times d}$, $\hbox{$\hbox{$\mathbf{K}$}$} \in \hbox{$\mathbb{R}$}^{p \times p}$. Note that we merge factors $\frac{1}{2\sqrt{d}}$ and $\frac{1}{2\sqrt{p}}$ in Eq. \ref{['eq:mat_gauss_norm_coord']} into the updates in $\hbox{$\hbox{$\mathbf{m}$}$}_K$ and $\hbox{$\hbox{$\mathbf{m}$}$}_C$, respectively (see Eq. \ref{['eq:matgauss_just']} in Appx. \ref{['apd:mat_gauss_dl']} for a justification). We use the linear truncation of the matrix exponential function. Our update does not require explicit matrix inverses. We can also pre-compute $\hbox{$\hbox{$\mathbf{C}$}$}\hbox{$\hbox{$\mathbf{C}$}$}^T$ and $\hbox{$\hbox{$\mathbf{K}$}$}\hbox{$\hbox{$\mathbf{K}$}$}^T$ when $\hbox{$\hbox{$\mathbf{T}$}$}>1$. In KFAC, a damping term $\lambda \hbox{$\hbox{$\mathbf{I}$}$}$ is introduced to handle the singularity of $\left(\mathbf{KK^T}\right)^{-1}$ and $\left(\mathbf{CC^T}\right)^{-1}$. We introduce a similar damping term in $\kappa^2$ and $c^2$ (see Appx. \ref{['apd:mat_gauss_dl']} for a derivation) to improve numerical stability. Our update and KFAC include momentum weight $\alpha_2$ for layer-wise NN weights $\hbox{$\hbox{$\boldsymbol{\mu}$}$}$ and (L2) weight decay $\gamma$. In our update, we also introduce momentum weight $\alpha_1$ in the SPD preconditioner. Our update is more numerically robust than KFAC. Thus, our update can often use a larger stepsize $\beta_2$ and a smaller damping weight $\lambda$ than KFAC.
  • Figure 3: The performance of our updates for optimization problems. Fig. \ref{['fig:a']}-\ref{['fig:b']} show the performance on SPD manifold optimization problems. Our update using approximations of the Riemannian maps achieves a similar performance as existing Riemannian methods using the exact Riemannian maps. Fig. \ref{['fig:c']} shows the performance on a MLE problem on a Gaussian mixture. The method denoted by "sub" performs updates on a SPD submanifold (see sec. \ref{['sec:sngd_special']}) while the other methods perform updates on a SPD manifold. Note that the loss in Fig. \ref{['fig:c']} is computed by augmented $(\!d\!+\!1\!)$-dim Gaussian components suggested by hosseini2015matrix. If we perform updates on the SPD manifold $\mathcal{S}_{++}^{k \times k}$ with $k\!=\!d+1$ instead of the submanifold, we cannot obtain the original (non-augmented) $d$-dim Gaussian components during the iterations since the updates are not guaranteed to stay on the submanifold. Thus, we cannot use the standard MLE loss defined by the $d$-dim Gaussians. In Fig. \ref{['fig:a']}-\ref{['fig:c']}, we use the same stepsize and momentum weight for all methods. Note that our method and lin2021tractable can use a larger stepsize than the other methods using the exact Riemannian maps. Our method and lin2021tractable use the quadratic truncation while the other methods use the exact maps. We observe that our method with truncation is more numerically robust than the other methods using the exact maps. Fig. \ref{['fig:d']} shows the performance using a structured preconditioner to optimize a $1000$-dim function, where our update and structured NGD use Hessian information without computing the full Hessian.
  • Figure 4: The error curves for optimization in deep NN models on the "ImageNet-100" dataset. Our updates achieve lower test error rates than the other baseline methods for NN optimization. We report the number of learnable NN weights, $k$, in a round bracket shown in the title of each plot. For each NN optimization problem, our approach uses a structured and sparse $k$-by-$k$ SPD preconditioning matrix induced by a SPD submanifold. As shown in Table \ref{['tab:spd_table']}, it is computationally infeasible to use many Riemannian momentum methods since they are designed for (dense) SPD preconditioning matrices and have $O(k^3)$ time complexity.
  • Figure 5: Performance of NN optimizers on more datasets. SGD performs best in the classical model and fairly in the modern models. Our updates achieve competitive test error rates compared to baselines and perform better than KFAC in many cases.
  • ...and 1 more figures