Table of Contents
Fetching ...

Structured Inverse-Free Natural Gradient: Memory-Efficient & Numerically-Stable KFAC

Wu Lin, Felix Dangel, Runa Eschenhagen, Kirill Neklyudov, Agustinus Kristiadi, Richard E. Turner, Alireza Makhzani

TL;DR

This work tackles the memory and numerical stability drawbacks of second-order methods in deep learning, notably KFAC, by introducing Structured Inverse-Free NGD (SINGD). SINGD unifies inverse-free updates (INGD) with structured Kronecker factors to achieve memory efficiency and robustness, often surpassing AdamW in mixed-precision settings. It demonstrates that IKFAC aligns with KFAC in the inverse-free regime and shows that a range of structured factors (diagonal, block-diagonal, hierarchical, Toeplitz) can substantially reduce memory while preserving performance. Empirical results across CNNs, transformers, and GNNs, including large-scale ViT on ImageNet-100, indicate that SINGD delivers competitive or superior test accuracy with lower memory and comparable or lower iteration cost, thereby broadening the applicability of second-order methods in low-precision training.

Abstract

Second-order methods such as KFAC can be useful for neural net training. However, they are often memory-inefficient since their preconditioning Kronecker factors are dense, and numerically unstable in low precision as they require matrix inversion or decomposition. These limitations render such methods unpopular for modern mixed-precision training. We address them by (i) formulating an inverse-free KFAC update and (ii) imposing structures in the Kronecker factors, resulting in structured inverse-free natural gradient descent (SINGD). On modern neural networks, we show that SINGD is memory-efficient and numerically robust, in contrast to KFAC, and often outperforms AdamW even in half precision. Our work closes a gap between first- and second-order methods in modern low-precision training.

Structured Inverse-Free Natural Gradient: Memory-Efficient & Numerically-Stable KFAC

TL;DR

This work tackles the memory and numerical stability drawbacks of second-order methods in deep learning, notably KFAC, by introducing Structured Inverse-Free NGD (SINGD). SINGD unifies inverse-free updates (INGD) with structured Kronecker factors to achieve memory efficiency and robustness, often surpassing AdamW in mixed-precision settings. It demonstrates that IKFAC aligns with KFAC in the inverse-free regime and shows that a range of structured factors (diagonal, block-diagonal, hierarchical, Toeplitz) can substantially reduce memory while preserving performance. Empirical results across CNNs, transformers, and GNNs, including large-scale ViT on ImageNet-100, indicate that SINGD delivers competitive or superior test accuracy with lower memory and comparable or lower iteration cost, thereby broadening the applicability of second-order methods in low-precision training.

Abstract

Second-order methods such as KFAC can be useful for neural net training. However, they are often memory-inefficient since their preconditioning Kronecker factors are dense, and numerically unstable in low precision as they require matrix inversion or decomposition. These limitations render such methods unpopular for modern mixed-precision training. We address them by (i) formulating an inverse-free KFAC update and (ii) imposing structures in the Kronecker factors, resulting in structured inverse-free natural gradient descent (SINGD). On modern neural networks, we show that SINGD is memory-efficient and numerically robust, in contrast to KFAC, and often outperforms AdamW even in half precision. Our work closes a gap between first- and second-order methods in modern low-precision training.
Paper Structure (22 sections, 5 theorems, 46 equations, 9 figures, 6 tables)

This paper contains 22 sections, 5 theorems, 46 equations, 9 figures, 6 tables.

Key Result

Theorem 1

If $\hbox{$\hbox{$\mathbf{K}$}$}$ is updated according to the IKFAC scheme (fig:matDL_opt2) with the truncation of the matrix exponential and these two updates use the same initialization and the same sequence of curvature matrices $\hbox{$\hbox{$\mathbf{U}$}$}$, then the product $\hbox{$\hbox{$\mat

Figures (9)

  • Figure 1: CIFAR-100 experiments on VGG net. Left/Center: Our methods (IKFAC and SINGD) outperform AdamW and perform stably in FP-32 and BFP-16---unlike KFAC---as they do not require matrix inversions. IKFAC effectively performs KFAC updates and achieves similar performance in FP-32. For this task, replacing the dense Kronecker factors (INGD = SINGD-Dense) with diagonal ones (SINGD-Diag) does not harm performance while reducing cost. Right: Memory consumption. Removing Riemannian momentum (IKFAC) or using structured Kronecker factors (SINGD-Diag) reduces INGD's memory in FP-32 and BFP-16. In BFP-16, SINGD-Diag achieves AdamW's memory consumption (dashed line).
  • Figure 2: Existing methods and their relation to our proposed methods. IKFAC behaves like KFAC (\ref{['thm:kfac_K_part']}), but is numerically stable in low precision. In contrast to IKFAC, INGD has Riemannian momenta and adaptive damping and curvature, which can yield better performance in practice (\ref{['sec:experiment']}). INGD is equivalent to SINGD with unstructured Kronecker factors (SINGD-Dense). Structured Kronecker factors reduce memory and computational cost.
  • Figure 2: Subspaces of the logarithm space and their projection maps $\hat{\Pi}(\hbox{$\hbox{$\mathbf{M}$}$})$, where $\hbox{$\hbox{$\mathbf{M}$}$}$ is a symmetric matrix. The hierarchical structure is constructed by replacing the diagonal matrix $\hbox{$\hbox{$\mathbf{D}$}$}_{22}$ in the rank-k upper-triangular structure with another rank-$k$ triangular matrix $\hbox{$\hbox{$\mathbf{A}$}$}_{22}\mathbf{0}\hbox{$\hbox{$\mathbf{A}$}$}_{23}\hbox{$\hbox{$\mathbf{A}$}$}_{33}$ for a better approximation.
  • Figure 3: Comparison between KFAC and IKFAC update for one weight matrix $\mathrm{vec}^{-1}(\hbox{$\hbox{$\boldsymbol{\mu}$}$}) \in \hbox{$\mathbb{R}$}^{d_o \times d_i}$. The flattened gradient is $\hbox{$\hbox{$\mathbf{g}$}$}\coloneq\nabla_\mu \ell(\hbox{$\hbox{$\boldsymbol{\mu}$}$}) \in \hbox{$\mathbb{R}$}^{d_o d_i}$ and $\mathrm{vec}^{-1}(\hbox{$\hbox{$\mathbf{g}$}$}) \in \hbox{$\mathbb{R}$}^{d_o \times d_i}$ is its matrix reshape. IKFAC uses $\hbox{$\hbox{$\mathbf{H}$}$}_K \coloneq \hbox{$\hbox{$\mathbf{K}$}$}^\top \hbox{$\hbox{$\mathbf{U}$}$} \hbox{$\hbox{$\mathbf{K}$}$}$ and $\hbox{$\hbox{$\mathbf{H}$}$}_C \coloneq \hbox{$\hbox{$\mathbf{C}$}$}^\top \hbox{$\hbox{$\mathbf{G}$}$} \hbox{$\hbox{$\mathbf{C}$}$}$ to incorporate the Kronecker curvature $\hbox{$\hbox{$\mathbf{U}$}$}$ and $\hbox{$\hbox{$\mathbf{G}$}$}$. Both methods use momentum buffers $\hbox{$\hbox{$\mathbf{m}$}$}_\mu$ for the weight-decayed update direction with momentum $\alpha_2$ and weight decay $\gamma$, and a learning rate $\beta_2$ for the parameter update. (Left) KFAC uses an exponentially moving average with decay $1 - \beta_1$ to accumulate the Kronecker factors and applies a damping term $\lambda \hbox{$\hbox{$\mathbf{I}$}$}$ before inversion to handle potential singularities in $\hbox{$\hbox{$\mathbf{S}$}$}_K$, $\hbox{$\hbox{$\mathbf{S}$}$}_C$. (Right) In contrast to KFAC, IKFAC directly approximates ${(\hbox{$\hbox{$\mathbf{S}$}$}_K +\lambda \hbox{$\hbox{$\mathbf{I}$}$})^{-1}}$ and ${(\hbox{$\hbox{$\mathbf{S}$}$}_C +\lambda \hbox{$\hbox{$\mathbf{I}$}$})^{-1}}$ by $\hbox{$\hbox{$\mathbf{K}$}$} {\hbox{$\hbox{$\mathbf{K}$}$}^\top}$ and $\hbox{$\hbox{$\mathbf{C}$}$} {\hbox{$\hbox{$\mathbf{C}$}$}^\top}$. The pre-conditioner update is a modification of INGD lin2023simplifying and the changes---zero Riemannian momentum, and non-adaptive damping and curvature---are highlighted in red.
  • Figure 5: Illustration of structured matrices (Kronecker factors) supported by SINGD, their self-outer product (approximate inverse Hessian factor), and its inverse (approximate Hessian factor). With rank-one triangular matrices $\hbox{$\hbox{$\mathbf{K}$}$}$, we can easily impose a low-rank structure on $\hbox{$\hbox{$\mathbf{K}$}$}\hbox{$\hbox{$\mathbf{K}$}$}^\top$ or $(\hbox{$\hbox{$\mathbf{K}$}$}\hbox{$\hbox{$\mathbf{K}$}$}^{\top})^{-1}$; the latter is difficult to achieve with other approaches.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Lemma 4.1
  • Lemma 4.2
  • Lemma 4.3
  • Theorem 2