Table of Contents
Fetching ...

Optimizing Neural Networks with Kronecker-factored Approximate Curvature

James Martens, Roger Grosse

TL;DR

The paper addresses the inefficiency of exact natural-gradient updates in large neural networks by introducing Kronecker-Factored Approximate Curvature (K-FAC), which approximates the Fisher information matrix with a block Kronecker structure per layer. This surrogate F̃ can be inverted efficiently, and the authors develop two practical inverse forms (block-diagonal and block-tridiagonal) along with online statistics estimation and a robust damping scheme that includes a re-scaling step using the exact Fisher. Key contributions include online estimation of necessary statistics, a principled damping strategy with a factored Tikhonov regularization, momentum integration, and invariance analyses; they also demonstrate substantial speedups over SGD with momentum on challenging deep autoencoder benchmarks. The work suggests significant practical impact for scalable, second-order optimization, especially in distributed settings, and outlines future extensions to convolutional/recurrent architectures and parallel computation strategies.

Abstract

We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.

Optimizing Neural Networks with Kronecker-factored Approximate Curvature

TL;DR

The paper addresses the inefficiency of exact natural-gradient updates in large neural networks by introducing Kronecker-Factored Approximate Curvature (K-FAC), which approximates the Fisher information matrix with a block Kronecker structure per layer. This surrogate F̃ can be inverted efficiently, and the authors develop two practical inverse forms (block-diagonal and block-tridiagonal) along with online statistics estimation and a robust damping scheme that includes a re-scaling step using the exact Fisher. Key contributions include online estimation of necessary statistics, a principled damping strategy with a factored Tikhonov regularization, momentum integration, and invariance analyses; they also demonstrate substantial speedups over SGD with momentum on challenging deep autoencoder benchmarks. The work suggests significant practical impact for scalable, second-order optimization, especially in distributed settings, and outlines future extensions to convolutional/recurrent architectures and parallel computation strategies.

Abstract

We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.

Paper Structure

This paper contains 31 sections, 5 theorems, 78 equations, 11 figures, 2 algorithms.

Key Result

Theorem 1

There exists an invertible linear function $\theta = \zeta(\theta^\dagger)$ so that $f^\dagger(x,\theta^\dagger) = f(x,\theta) = f(x,\zeta(\theta^\dagger))$, and thus the transformed network can be viewed as a reparameterization of the original network by $\theta^\dagger$. Moreover, additively updat

Figures (11)

  • Figure 1: A depiction of a standard feed-forward neural network for $\ell = 2$.
  • Figure 2: A comparison of the exact Fisher $F$ and our block-wise Kronecker-factored approximation $\tilde{F}$, for the middle 4 layers of a standard deep neural network partially trained to classify a 16x16 down-scaled version of MNIST. The network was trained with 7 iterations of K-FAC in batch mode, achieving 5% error (the error reached 0% after 22 iterations) . The network architecture is 256-20-20-20-20-20-10 and uses standard tanh units. On the left is the exact Fisher $F$, in the middle is our approximation $\tilde{F}$, and on the right is the difference of these. The dashed lines delineate the blocks. Note that for the purposes of visibility we plot the absolute values of the entries, with the white level corresponding linearly to the size of these values (up to some maximum, which is the same in each image).
  • Figure 3: A comparison of our block-wise Kronecker-factored approximation $\tilde{F}$, and its inverse, using the example neural network from Figure \ref{['fig:kron_approx']}. On the left is $\tilde{F}$, in the middle is its exact inverse, and on the right is a 4x4 matrix containing the averages of the absolute values of the entries in each block of the inverse. As predicted by our theory, the inverse exhibits an approximate block-tridiagonal structure, whereas $\tilde{F}$ itself does not. Note that the corresponding plots for the exact $F$ and its inverse look similar. The very small blocks visible on the diagonal of the inverse each correspond to the weights on the outgoing connections of a particular unit. The inverse was computed subject to the factored Tikhonov damping technique described in Sections \ref{['sec:factored_tik']} and \ref{['sec:gamma']}, using the same value of $\gamma$ that was used by K-FAC at the iteration from which this example was taken (see Figure \ref{['fig:kron_approx']}). Note that for the purposes of visibility we plot the absolute values of the entries, with the white level corresponding linearly to the size of these values (up to some maximum, which is chosen differently for the Fisher approximation and its inverse, due to the highly differing scales of these matrices).
  • Figure 4: A diagram depicting the UGGM corresponding to $\hat{F}^{-1}$ and its equivalent DGGM. The UGGM's edges are labeled with the corresponding weights of the model (these are distinct from the network's weights). Here, $(\hat{F}^{-1})_{i,j}$ denotes the $(i,j)$-th block of $\hat{F}^{-1}$. The DGGM's edges are labeled with the matrices that specify the linear mapping from the source node to the conditional mean of the destination node (whose conditional covariance is given by its label).
  • Figure 5: A comparison of our block-wise Kronecker-factored approximation $\tilde{F}$, and its approximations $\breve{F}$ and $\hat{F}$ (which are based on approximating the inverse $\tilde{F}^{-1}$ as either block-diagonal or block-tridiagonal, respectively), using the example neural network from Figure \ref{['fig:kron_approx']}. On the left is $\tilde{F}$, in the middle its approximation, and on the right is the absolute difference of these. The top row compares to $\breve{F}$ and the bottom row compares to $\hat{F}$. While the diagonal blocks of the top right matrix, and the tridiagonal blocks of the bottom right matrix are exactly zero due to how $\breve{F}$ and $\hat{F}$ (resp.) are constructed, the off-tridiagonal blocks of the bottom right matrix, while being very close to zero, are actually non-zero (which is hard to see from the plot). Note that for the purposes of visibility we plot the absolute values of the entries, with the white level corresponding linearly to the size of these values (up to some maximum, which is the same in each image).
  • ...and 6 more figures

Theorems & Definitions (8)

  • Theorem 1
  • Corollary 2
  • Corollary 3
  • Lemma 4
  • proof : Proof of Lemma \ref{['lemma:udv_expectation']}
  • proof : Proof of Theorem \ref{['thm:invariance']}
  • Lemma 5
  • proof : Proof of Corollary \ref{['cor:whitened_interpretation']}