Table of Contents
Fetching ...

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

TL;DR

The paper addresses how to generalize dynamical isometry to deep nonlinear networks by deriving the full singular-value spectrum of the input-output Jacobian using free probability theory. It shows that ReLU networks cannot sustain dynamical isometry, while orthogonal initialization paired with sigmoidal nonlinearities can, enabling dramatically faster learning in deep nets. Empirical results on CIFAR-10 demonstrate substantial speedups and sublinear training-time growth with depth when dynamical isometry is present, underscoring the practical importance of shaping the entire Jacobian spectrum. The work highlights a design principle: beyond controlling the mean or second moment, concentrating the Jacobian spectrum around 1 can yield tangible gains in deep learning performance.

Abstract

It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is $O(1)$ is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values of the Jacobian concentrate near $1$ is a property known as dynamical isometry. For deep linear networks, dynamical isometry can be achieved through orthogonal weight initialization and has been shown to dramatically speed up learning; however, it has remained unclear how to extend these results to the nonlinear setting. We address this question by employing powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian. We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity. Intriguingly, we find that ReLU networks are incapable of dynamical isometry. On the other hand, sigmoidal networks can achieve isometry, but only with orthogonal weight initialization. Moreover, we demonstrate empirically that deep nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not. Indeed, we show that properly-initialized deep sigmoidal networks consistently outperform deep ReLU networks. Overall, our analysis reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning.

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

TL;DR

The paper addresses how to generalize dynamical isometry to deep nonlinear networks by deriving the full singular-value spectrum of the input-output Jacobian using free probability theory. It shows that ReLU networks cannot sustain dynamical isometry, while orthogonal initialization paired with sigmoidal nonlinearities can, enabling dramatically faster learning in deep nets. Empirical results on CIFAR-10 demonstrate substantial speedups and sublinear training-time growth with depth when dynamical isometry is present, underscoring the practical importance of shaping the entire Jacobian spectrum. The work highlights a design principle: beyond controlling the mean or second moment, concentrating the Jacobian spectrum around 1 can yield tangible gains in deep learning performance.

Abstract

It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values of the Jacobian concentrate near is a property known as dynamical isometry. For deep linear networks, dynamical isometry can be achieved through orthogonal weight initialization and has been shown to dramatically speed up learning; however, it has remained unclear how to extend these results to the nonlinear setting. We address this question by employing powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian. We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity. Intriguingly, we find that ReLU networks are incapable of dynamical isometry. On the other hand, sigmoidal networks can achieve isometry, but only with orthogonal weight initialization. Moreover, we demonstrate empirically that deep nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not. Indeed, we show that properly-initialized deep sigmoidal networks consistently outperform deep ReLU networks. Overall, our analysis reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning.

Paper Structure

This paper contains 12 sections, 34 equations, 6 figures.

Figures (6)

  • Figure 1: Order-chaos transition when $\phi(h) = \tanh(h)$. The critical line $\chi(\sigma_w,\sigma_b)=1$ determines the boundary between two phases poole2016schoenholz2016: (a) a chaotic phase when $\chi > 1$, where forward signal propagation expands and folds space in a chaotic manner and back-propagated gradients exponentially explode, and (b) an ordered phase when $\chi < 1$, where forward signal propagation contracts space in an ordered manner and back-propagated gradients exponentially vanish. The value of $q^*$ along the critical line separating the two phases is shown as a heatmap.
  • Figure 2: Examples of deep spectra at criticality for different nonlinearities at different depths. Excellent agreement is observed between empirical simulations of networks of width 1000 (dashed lines) and theoretical predictions (solid lines). ReLU and hard tanh are with orthogonal weights, and linear is with Gaussian weights. Gaussian linear and orthogonal ReLU have similarly-shaped distributions, especially for large depths, where poor conditioning and many large singular values are observed. On the other hand, orthogonal hard tanh is much better conditioned.
  • Figure 3: The max singular value $s_{\text{max}}$ of $\mathbf{J}$ versus $L$ and $q^*$ for Gaussian (a,c) and orthogonal (b,d) weights, with ReLU (dashed) and hard-tanh (solid) networks. For Gaussian weights and for both ReLU and hard-tanh, $s_{\text{max}}$ grows with $L$ for all $q^*$ (see a,c) as predicted in eqn. \ref{['eqn:ginibre_lambda_large_L']} . In contrast, for orthogonal hard-tanh, but not orthogonal ReLU, at small enough $q^*$, $s_{\text{max}}$ can remain $\mathcal{O}(1)$ even at large $L$ (see b,d) as predicted in eqn. \ref{['eqn:smaxorth']}. In essence, at fixed small $q^*$, if $p(q^*)$ is the large fraction of neurons in the linear regime, $s_{\text{max}}$ only grows with $L$ after $L > p/(1-p)$ (see d). As $q^*\to 0$, $p(q^*)\to 1$ and the hard-tanh networks look linear. Thus the lowest curve in (a) corresponds to the prediction of linear Gaussian networks in eqn. \ref{['eqn:linear_sv']}, while the lowest curve in (b) is simply $1$, corresponding to linear orthogonal networks.
  • Figure 4: Learning dynamics, measured by generalization performance on a test set, for networks of depth $200$ and width $400$ trained on CIFAR-10 with different optimizers. Blue is $tanh$ with $\sigma_w^2=1.05$, red is $tanh$ with $\sigma_w^2 = 2$, and black is ReLU with $\sigma_w^2 = 2$. Solid lines are orthogonal and dashed lines are Gaussian initialization. The relative ordering of curves robustly persists across optimizers, and is strongly correlated with the degree to which dynamical isometry is present at initialization, as measured by $s_{\text{max}}$ in Fig. \ref{['fig:max_svs']}. Networks with $s_{\text{max}}$ closer to $1$ learn faster, even though all networks are initialized critically with $\chi=1$. The most isometric orthogonal $tanh$ with small $\sigma_w^2$ trains several orders of magnitude faster than the least isometric ReLU network.
  • Figure 5: Empirical measurements of SGD training time $\tau$, defined as number of steps to reach $p\approx0.25$ accuracy, for orthogonal $\tanh$ networks. In (a), curves reflect different depths $L$ at fixed small $q^*=0.025$. Intriguingly, they all collapse onto a single universal curve when the learning rate $\eta$ is rescaled by $L$ and $\tau$ is rescaled by $1/\sqrt{L}$. This implies the optimal learning rate is $O(1/L)$, and remarkably, the optimal learning time $\tau$ grows only as $O(\sqrt{L})$. (b) Now different curves reflect different $q^*$ at fixed $L=200$, revealing that smaller $q^*$, associated with increased dynamical isometry in $\mathbf{J}$, enables faster training times by allowing a larger optimal learning rate $\eta$. (c) $\tau$ as a function of $L$ for a few values of $q^*$. (d) $\tau$ as a function of $q^*$ for a few values of $L$. We see qualitative agreement of (c,d) with Fig. \ref{['fig:max_svs']}(b,d), suggesting a strong connection between $\tau$ and $s_{\text{max}}$.
  • ...and 1 more figures

Theorems & Definitions (5)

  • proof
  • Example 1
  • proof
  • Example 2
  • proof