Table of Contents
Fetching ...

Random Walk Initialization for Training Very Deep Feedforward Networks

David Sussillo, L. F. Abbott

TL;DR

This paper addresses the vanishing/exploding gradient problem in very deep feedforward networks by treating the backpropagated gradient as a product of random matrices across layers. It develops a mathematical framework where the log-norm of the gradient performs a random walk, and derives per-layer scaling values g that make this walk unbiased for linear and ReLU activations (with numerical guidance for tanh). The authors show that the gradient’s log-norm variance grows linearly with depth and is inversely related to layer width, implying wider layers help stabilize training. Empirical evidence on MNIST and TIMIT demonstrates that, with the proposed Random Walk Initialization, networks with hundreds to even a thousand layers can be trained, though depth alone does not guarantee better training error; practical guidance on input/output scaling and learning-rate schedules is provided.

Abstract

Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.

Random Walk Initialization for Training Very Deep Feedforward Networks

TL;DR

This paper addresses the vanishing/exploding gradient problem in very deep feedforward networks by treating the backpropagated gradient as a product of random matrices across layers. It develops a mathematical framework where the log-norm of the gradient performs a random walk, and derives per-layer scaling values g that make this walk unbiased for linear and ReLU activations (with numerical guidance for tanh). The authors show that the gradient’s log-norm variance grows linearly with depth and is inversely related to layer width, implying wider layers help stabilize training. Empirical evidence on MNIST and TIMIT demonstrates that, with the proposed Random Walk Initialization, networks with hundreds to even a thousand layers can be trained, though depth alone does not guarantee better training error; practical guidance on input/output scaling and learning-rate schedules is provided.

Abstract

Training very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.

Paper Structure

This paper contains 10 sections, 19 equations, 4 figures.

Figures (4)

  • Figure 1: Sample random walks of random vectors back-propagated through a linear network. (Top) Many samples of random walks from equation (\ref{['eq:LogZDef']}) with $N=100$, $D=500$ and $g=1.005$, as determined by equation (\ref{['eq:gOptComput']}). Both the starting vectors as well as all matrices were generated randomly at each step of the random walk. (Middle) The mean over all instantiations (blue) is close to zero (red line) because the optimal $g$ value was used. (Bottom) The variance of the random walks at layer $d$ (blue), and the value predicted by equation (\ref{['eq:rwv']}) (red).
  • Figure 2: Top - Numerical simulation of the best $g$ as a function of $N$, using equations (\ref{['eq:fp1']}-\ref{['eq:bp']}) using random vectors for $\bm{h}_0$ and $\boldsymbol{\delta}_D$. Black shows results of numerical simulations, and red shows the predicted best $g$ values from equations (\ref{['eq:gOptComput']}) and (\ref{['eq:gReLU']}). (Left) $\hbox{linear}$, (Middle) - $\mathrm{ReLU}$, (Right) - $\tanh$. Bottom - Numerical simulation of the average $\log(|\boldsymbol{\delta}_0|/|\boldsymbol{\delta}_D|)$ as a function of $g$, again using equations (\ref{['eq:fp1']}-\ref{['eq:bp']}). Results from equations (\ref{['eq:gOptComput']}) and (\ref{['eq:gReLU']}) are indicated by red arrows. Guidelines at 0 (solid green) and -1, 1 (dashed green) are provided.
  • Figure 3: Training error on MNIST as a function of $g$ and $D$. Each simulation used a parameter limit of $4e6$. Error shown on a $\log_{10}$ scale. The $g$ parameter is varied on the x-axis, and color denotes various values of $D$. (Upper left) Training error for the $\tanh$ function for all learning rate combinations. The learning rate hyper-parameters $\lambda_{in}$ and $\lambda_{out}$ are not visually distinguished. (Lower left) Same as upper left except showing the minimum training error for all learning rate combinations. (Upper right and lower right) Same as left, but for the $\mathrm{ReLU}$ nonlinearity. For both nonlinearities, the experimental results are in good agreement with analytical and experimental predictions.
  • Figure 4: Performance results using Random Walk Initialization on MNIST and TIMIT. Each network had the same parameter limit, regardless of depth. Training error is shown on a $\log_{2}-\log_{10}$ plot. The legend for color coding of $\lambda_{out}$ is shown at right. The values of $\lambda_{in}$ were varied with the same values as $\lambda_{out}$ and are shown with different markers. Varied $g$ values were also used and averaged over. For A and B, $g = [1.05, 1.1, 1.15, 1.2]$. (A) The classification training error on MNIST as a function of $D$, $\lambda_{in}$ and $\lambda_{out}$. (B) MNIST Auto-encoder reconstruction error as a function of hyper-parameters. (C) Experiments on MNIST with $D = 1000$. Training error is shown as a function of training epoch. Hyper-parameters of $\lambda_{in}, \lambda_{out}$, were varied to get a sense of the difficulty of training such a deep network. The value of $g = 1.05$ was used to combat pathological curvature. (D) Classification training error on TIMIT dataset. Values of $g = [1.1, 1.15, 1.2, 1.25]$ were averaged over.