Table of Contents
Fetching ...

The Emergence of Spectral Universality in Deep Networks

Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

TL;DR

The paper develops a free-probability framework to exactly characterize the full input-output Jacobian spectrum of deep networks at initialization. By deriving a master equation that ties the spectrum to transforms of the nonlinearity and the weight ensemble, it reveals how depth, nonlinearities, and initialization jointly shape spectral concentration around unity. It uncovers two universal limiting spectral laws—Bernoulli-like and smooth—emerging under a double-scaling regime with orthogonal weights, and shows that orthogonality is essential for stable universality. These results provide principled guidance for choosing nonlinearities and weight preparations to achieve dynamical isometry and fast learning in very deep networks.

Abstract

Recent work has shown that tight concentration of the entire spectrum of singular values of a deep network's input-output Jacobian around one at initialization can speed up learning by orders of magnitude. Therefore, to guide important design choices, it is important to build a full theoretical understanding of the spectra of Jacobians at initialization. To this end, we leverage powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth. For a variety of nonlinearities, our work reveals the emergence of new universal limiting spectral distributions that remain concentrated around one even as the depth goes to infinity.

The Emergence of Spectral Universality in Deep Networks

TL;DR

The paper develops a free-probability framework to exactly characterize the full input-output Jacobian spectrum of deep networks at initialization. By deriving a master equation that ties the spectrum to transforms of the nonlinearity and the weight ensemble, it reveals how depth, nonlinearities, and initialization jointly shape spectral concentration around unity. It uncovers two universal limiting spectral laws—Bernoulli-like and smooth—emerging under a double-scaling regime with orthogonal weights, and shows that orthogonality is essential for stable universality. These results provide principled guidance for choosing nonlinearities and weight preparations to achieve dynamical isometry and fast learning in very deep networks.

Abstract

Recent work has shown that tight concentration of the entire spectrum of singular values of a deep network's input-output Jacobian around one at initialization can speed up learning by orders of magnitude. Therefore, to guide important design choices, it is important to build a full theoretical understanding of the spectra of Jacobians at initialization. To this end, we leverage powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth. For a variety of nonlinearities, our work reveals the emergence of new universal limiting spectral distributions that remain concentrated around one even as the depth goes to infinity.

Paper Structure

This paper contains 34 sections, 102 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Order-chaos transition when $\phi(h) = \tanh(h)$. The critical line $\chi=1$ determines the boundary between the two phases. In the chaotic regime $\chi>1$ and gradients explode while in the ordered regime $\chi< 1$ and we expect gradients to vanish. The value of $q^*$ along this line is shown as a heatmap.
  • Figure 2: Examples of deep spectra at criticality for different nonlinearities at different depths. Singular values from empirical simulations of networks of width 1000 are shown with solid lines while theoretical predictions from the master equation and algorithm are overlaid with dashed lines. For each panel, the weight variance $\sigma_w^2$ is held constant as the depth increases. Notice that linear Gaussian and orthogonal ReLU have similarly-shaped distributions, especially for large depths, where poor conditioning and many large singular values are observed. Erf and Hard Tanh are better conditioned, but at 128 layers we begin to observe some spread in the distributions.
  • Figure 3: Distribution of $\phi'(h)$ for different nonlinearities. The top row shows the nonlinearity, $\phi(h)$, along with the Gaussian distribution of pre-activations $h$ for four different choices of the variance, $q^*$. The bottom row gives the induced distribution of $\phi'(h)$. We see that for ReLU the distribution is independent of $q^*$. This implies that there is no stable limiting distribution for the spectrum of $\mathbf{JJ}^T$. By contrast for the other nonlinearities the distribution is a relatively strong function of $q^*$.
  • Figure 4: Two limiting universality classes of Jacobian spectra. Hard Tanh and Shifted ReLU fall into one class, characterized by Bernoulli-distributed $\phi'(h)^2$, while Erf and Smoothed ReLU fall into a second class, characterized by a smooth distribution for $\phi'(h)^2$. The black curves are theoretical predictions for the limiting distributions with variance $\sigma_0^2 = 1/4$. The colored lines are emprical spectra of finite-depth width-1000 orthogonal neural networks. The empirical spectra converge to the limiting distributions in all cases. The rate of convergence is similar for Hard-Tanh and Shifted ReLU, whereas it is significantly different for Erf and Smoothed Relu, which converge to the same limiting distribution along distinct trajectories. In all cases, the solid colored lines go from shallow $L=2$ networks (red) to deep networks (purple). In all cases but Erf the deepest networks have $L=128$. For Erf, the dashed lines show solutions to \ref{['eqn:Geqn_arb']} for very large depth up to $L = 8192$.