Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

Jeffrey Pennington; Samuel S. Schoenholz; Surya Ganguli

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

TL;DR

The paper addresses how to generalize dynamical isometry to deep nonlinear networks by deriving the full singular-value spectrum of the input-output Jacobian using free probability theory. It shows that ReLU networks cannot sustain dynamical isometry, while orthogonal initialization paired with sigmoidal nonlinearities can, enabling dramatically faster learning in deep nets. Empirical results on CIFAR-10 demonstrate substantial speedups and sublinear training-time growth with depth when dynamical isometry is present, underscoring the practical importance of shaping the entire Jacobian spectrum. The work highlights a design principle: beyond controlling the mean or second moment, concentrating the Jacobian spectrum around 1 can yield tangible gains in deep learning performance.

Abstract

It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is $O(1)$ is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values of the Jacobian concentrate near $1$ is a property known as dynamical isometry. For deep linear networks, dynamical isometry can be achieved through orthogonal weight initialization and has been shown to dramatically speed up learning; however, it has remained unclear how to extend these results to the nonlinear setting. We address this question by employing powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian. We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity. Intriguingly, we find that ReLU networks are incapable of dynamical isometry. On the other hand, sigmoidal networks can achieve isometry, but only with orthogonal weight initialization. Moreover, we demonstrate empirically that deep nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not. Indeed, we show that properly-initialized deep sigmoidal networks consistently outperform deep ReLU networks. Overall, our analysis reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning.

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

TL;DR

Abstract

is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values of the Jacobian concentrate near

is a property known as dynamical isometry. For deep linear networks, dynamical isometry can be achieved through orthogonal weight initialization and has been shown to dramatically speed up learning; however, it has remained unclear how to extend these results to the nonlinear setting. We address this question by employing powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian. We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity. Intriguingly, we find that ReLU networks are incapable of dynamical isometry. On the other hand, sigmoidal networks can achieve isometry, but only with orthogonal weight initialization. Moreover, we demonstrate empirically that deep nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not. Indeed, we show that properly-initialized deep sigmoidal networks consistently outperform deep ReLU networks. Overall, our analysis reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning.

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

TL;DR

Abstract

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)

Theorems & Definitions (5)