Table of Contents
Fetching ...

MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature

Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko

TL;DR

MAC addresses the impracticality of second-order optimizers by exploiting the spectral structure of Kronecker-factored FIMs and introducing a rank-1 activation KF. By approximating the activation factor with the mean activation outer product and treating the pre-activation gradient factor as identity, MAC enables a closed-form, Sherman–Morrison-invertible preconditioner suitable for both CNNs and transformers, including attention mechanisms. The authors prove convergence under overparameterization and demonstrate through extensive experiments that MAC achieves faster training and comparable or better accuracy than KFAC variants across CIFAR and ImageNet, while drastically reducing memory overhead. The work also integrates attention scores into the transformer preconditioner, yielding tangible accuracy gains on Vision Transformers and preserving stability in large-scale training.

Abstract

Second-order optimization methods for training neural networks, such as KFAC, exhibit superior convergence by utilizing curvature information of loss landscape. However, it comes at the expense of high computational burden. In this work, we analyze the two components that constitute the layer-wise Fisher information matrix (FIM) used in KFAC: the Kronecker factors related to activations and pre-activation gradients. Based on empirical observations on their eigenspectra, we propose efficient approximations for them, resulting in a computationally efficient optimization method called MAC. To the best of our knowledge, MAC is the first algorithm to apply the Kronecker factorization to the FIM of attention layers used in transformers and explicitly integrate attention scores into the preconditioning. We also study the convergence property of MAC on nonlinear neural networks and provide two conditions under which it converges to global minima. Our extensive evaluations on various network architectures and datasets show that the proposed method outperforms KFAC and other state-of-the-art methods in terms of accuracy, end-to-end training time, and memory usage.

MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature

TL;DR

MAC addresses the impracticality of second-order optimizers by exploiting the spectral structure of Kronecker-factored FIMs and introducing a rank-1 activation KF. By approximating the activation factor with the mean activation outer product and treating the pre-activation gradient factor as identity, MAC enables a closed-form, Sherman–Morrison-invertible preconditioner suitable for both CNNs and transformers, including attention mechanisms. The authors prove convergence under overparameterization and demonstrate through extensive experiments that MAC achieves faster training and comparable or better accuracy than KFAC variants across CIFAR and ImageNet, while drastically reducing memory overhead. The work also integrates attention scores into the transformer preconditioner, yielding tangible accuracy gains on Vision Transformers and preserving stability in large-scale training.

Abstract

Second-order optimization methods for training neural networks, such as KFAC, exhibit superior convergence by utilizing curvature information of loss landscape. However, it comes at the expense of high computational burden. In this work, we analyze the two components that constitute the layer-wise Fisher information matrix (FIM) used in KFAC: the Kronecker factors related to activations and pre-activation gradients. Based on empirical observations on their eigenspectra, we propose efficient approximations for them, resulting in a computationally efficient optimization method called MAC. To the best of our knowledge, MAC is the first algorithm to apply the Kronecker factorization to the FIM of attention layers used in transformers and explicitly integrate attention scores into the preconditioning. We also study the convergence property of MAC on nonlinear neural networks and provide two conditions under which it converges to global minima. Our extensive evaluations on various network architectures and datasets show that the proposed method outperforms KFAC and other state-of-the-art methods in terms of accuracy, end-to-end training time, and memory usage.

Paper Structure

This paper contains 31 sections, 7 theorems, 43 equations, 8 figures, 9 tables, 2 algorithms.

Key Result

Proposition 4.1

Let ${\bm{\mathrm{X}}}$ be an $m \times n$ matrix with column-wise mean vector $\bar{\bm{\mathrm{x}}} \in \mathbb{R}^n$. Define a perturbation matrix ${\bm{\mathrm{E}}}$ such that ${\bm{\mathrm{X}}} = \bm{\mathrm{1}}_m \bar{\bm{\mathrm{x}}}^\intercal + {\bm{\mathrm{E}}}$, where $\bm{\mathrm{1}}_m$ i

Figures (8)

  • Figure 1: Top-50 eigenspectra of FIM, activation KF, and pre-activation gradient KF in KFAC were analyzed at the beginning, middle, and end of training. (Top) A linear layer in LeNet-5 and a convolutional layer in ResNet-20. (Bottom) the patch embedding (convolutional) layer and an attention layer in DeiT-Tiny as representative examples.
  • Figure 2: (Left) Cosine similarity between the top eigenvector of ${\bm{\mathrm{A}}}$ and the mean activations per layer. (Right) Comparison of centered covariance norms with squared norms of mean activation using the CIFAR-100 dataset.
  • Figure 3: Trained DeiT-Tiny on Tiny ImageNet. (Left, Center) Eigenspectra of attention scores ${\bm{\mathrm{T}}}$ from two distinct blocks as representative cases. (Right) Cosine similarity between the top eigenvector of ${\bm{\mathrm{T}}}$ and the mean attention per block.
  • Figure 4: Comparison of the convergence factor between KFAC and MAC during (Left) LeNet-5 training on Fashion MNIST and (Right) ResNet-20 training on CIFAR-10 dataset.
  • Figure 5: Comparison of train loss and test accuracy over wall-clock time on CIFAR-100 dataset.
  • ...and 3 more figures

Theorems & Definitions (12)

  • Proposition 4.1
  • Definition 5.2: Limiting Gram Matrix
  • Theorem 5.5: MAC
  • Remark 5.6
  • proof
  • Lemma B.1: SchurBemerkungenZT
  • Lemma B.2: Appendix D.3 in zhang2019fast
  • Lemma B.3
  • proof
  • Lemma B.4: Lemma 3.2 in du2018gradient
  • ...and 2 more