Table of Contents
Fetching ...

Exponential expressivity in deep neural networks through transient chaos

Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, Surya Ganguli

TL;DR

The paper tackles how depth affects neural-network expressivity by exposing an order-to-chaos transition in signal propagation of random deep nets. It develops a mean-field–geometric framework that tracks length and curvature across layers via a length map $q^l$ and a correlation map $C$, predicting exponential growth of global curvature in depth within the chaotic regime. It shows that deep networks can disentangle curved input manifolds into flat hidden representations and that curvature grows exponentially with depth, while shallow networks cannot match this expressivity. This provides a quantitative null model and a geometric lens for understanding deep-function expressivity across arbitrary nonlinearities.

Abstract

We combine Riemannian geometry with the mean field theory of high dimensional chaos to study the nature of signal propagation in generic, deep neural networks with random weights. Our results reveal an order-to-chaos expressivity phase transition, with networks in the chaotic phase computing nonlinear functions whose global curvature grows exponentially with depth but not width. We prove this generic class of deep random functions cannot be efficiently computed by any shallow network, going beyond prior work restricted to the analysis of single functions. Moreover, we formalize and quantitatively demonstrate the long conjectured idea that deep networks can disentangle highly curved manifolds in input space into flat manifolds in hidden space. Our theoretical analysis of the expressive power of deep networks broadly applies to arbitrary nonlinearities, and provides a quantitative underpinning for previously abstract notions about the geometry of deep functions.

Exponential expressivity in deep neural networks through transient chaos

TL;DR

The paper tackles how depth affects neural-network expressivity by exposing an order-to-chaos transition in signal propagation of random deep nets. It develops a mean-field–geometric framework that tracks length and curvature across layers via a length map and a correlation map , predicting exponential growth of global curvature in depth within the chaotic regime. It shows that deep networks can disentangle curved input manifolds into flat hidden representations and that curvature grows exponentially with depth, while shallow networks cannot match this expressivity. This provides a quantitative null model and a geometric lens for understanding deep-function expressivity across arbitrary nonlinearities.

Abstract

We combine Riemannian geometry with the mean field theory of high dimensional chaos to study the nature of signal propagation in generic, deep neural networks with random weights. Our results reveal an order-to-chaos expressivity phase transition, with networks in the chaotic phase computing nonlinear functions whose global curvature grows exponentially with depth but not width. We prove this generic class of deep random functions cannot be efficiently computed by any shallow network, going beyond prior work restricted to the analysis of single functions. Moreover, we formalize and quantitatively demonstrate the long conjectured idea that deep networks can disentangle highly curved manifolds in input space into flat manifolds in hidden space. Our theoretical analysis of the expressive power of deep networks broadly applies to arbitrary nonlinearities, and provides a quantitative underpinning for previously abstract notions about the geometry of deep functions.

Paper Structure

This paper contains 21 sections, 1 theorem, 35 equations, 6 figures.

Key Result

Theorem 1

Suppose $\phi(h)$ is monotonically non-decreasing with bounded dynamic range $R$, i.e. $\max_h \phi(h) - \min_h \phi(h) = R$. Further suppose that $\mathbf{x}^0(\theta)$ is a curve in input space such that no 1D projection of $\partial_\theta \mathbf{x}(\theta)$ changes sign more than $s$ times ove

Figures (6)

  • Figure 1: Dynamics of the squared length $q^l$ for a sigmoidal network ($\phi(h) = \tanh(h)$) with 1000 hidden units. (A) The iterative length map in \ref{['eq:qliter']} for 3 different $\sigma_w$ at $\sigma_b=0.3$. Theoretical predictions (solid lines) match well with individual network simulations (dots). Stars reflect fixed points $q^*$ of the map. (B) The iterative dynamics of the length map yields rapid convergence of $q^l$ to its fixed point $q^*$ , independent of initial condition (lines=theory; dots=simulation). (C) $q^*$ as a function of $\sigma_w$ and $\sigma_b$. (D) Number of iterations required to achieve $\leq$ 1% fractional deviation off the fixed point. The $(\sigma_b,\sigma_w)$ pairs in (A,B) are marked with color matched circles in (C,D).
  • Figure 2: Dynamics of correlations, $c_{12}^l$, in a sigmoidal network with $\phi(h)=\tanh(h)$. (A) The $\mathcal{C}$-map in \ref{['eq:cciter']} for the same $\sigma_w$ and $\sigma_b=0.3$ as in Fig. \ref{['fig:qmap']}A. (B) The $\mathcal{C}$-map dynamics, derived from both theory, through \ref{['eq:cciter']} (solid lines) and numerical simulations of \ref{['eq:netdynam']} with $N_l = 1000$ (dots) (C) Fixed points $c^*$ of the $\mathcal{C}$-map. (D) The slope of the $\mathcal{C}$-map at $1$, $\chi_1$, partitions the space (black dotted line at $\chi_1=1$) into chaotic ($\chi_1 > 1$, $c^* < 1$) and ordered ($\chi_1 < 1$, $c^* = 1$) regions.
  • Figure 3: Propagating a circle through three random sigmoidal networks with varying $\sigma_w$ and fixed $\sigma_b=0.3$. (A) Projection of hidden inputs of simulated networks at layer 5 and 10 onto their first three principal components. Insets show the fraction of variance explained by the first 5 singular values. For large weights (bottom), the distribution of singular values gets flatter and the projected curve is more tangled. (B) The autocorrelation, $c^l_{12}(\Delta \theta) = \int d\theta \, q^l(\theta, \theta+\Delta \theta)/q^*$, of hidden inputs as a function of layer for simulated networks. (C) The theoretical predictions from \ref{['eq:cciter']} (solid lines) compared to the average (dots) and standard deviation across $\theta$ (shaded) in a simulated network.
  • Figure 4: Propagation of extrinsic curvature and length in a network with 1000 hidden units. (A) An osculating circle. (B) A curve with unit tangent vectors at 4 points in ambient space, and the image of these points under the Gauss map. (C-E) Propagation of curvature metrics based on both theory derived from iterative maps in \ref{['eq:qliter']}, \ref{['eq:cciter']} and \ref{['eq:kiter']} (solid lines) and simulations using \ref{['eq:netdynam']} (dots). (F) Schematic of the normal vector, tangent plane, and principal curvatures for a 2D manifold embedded in $\mathbb R^3$. (G) average principal curvatures for the largest and smallest 4 principal curvatures ($\kappa_{\pm 1}, \dots , \kappa_{\pm 4}$) across locations $\theta$ within one network. The principal curvatures all grow exponentially as we backpropagate to the input layer. Panels F,G are discussed in Sec. 5.
  • Figure 5: Deep networks in the chaotic regime are more expressive than shallow networks. (A) Activity of four different neurons in the output layer as a function of the input, $\theta$ for three networks of different depth (width $N_l=1,000)$. (B) Linear regression of the output activity onto a random function (black) shows closer predictions (blue) with deeper networks (bottom) than shallow networks (top). (C) Decomposing the prediction error by frequency shows shallow networks cannot capture high frequency content in random functions but deep networks can (yellow=high error). (D) Increasing the width of a one hidden layer network up to $10,000$ does not decrease error at high frequencies.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1