Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization

Cameron Jakub; Mihai Nica

Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization

Cameron Jakub, Mihai Nica

TL;DR

This work analyzes depth degeneracy in fully connected ReLU networks at initialization by tracking how the angle between two inputs evolves with depth. It develops a finite-width theory that captures layerwise fluctuations through mixed Gaussian moments $J_{a,b}(\theta)$ and derives a mean/variance description for $\ln(\sin^2(\theta_{\ell+1}))$; crucially, these results differ from the infinite-width limit by incorporating a width-dependent correction $\rho(n)$ and nonzero variance. The authors introduce a Gaussian-IBP framework to compute $J_{a,b}(\theta)$, reveal a combinatorial connection to the Bessel numbers via $P(a,b), Q(a,b)$, and provide explicit closed forms that enable accurate finite-width predictions. These results are validated through Monte Carlo simulations and applied to neural architecture search-like experiments, showing that smaller predicted angles correlate with poorer training outcomes and offering a practical tool to screen architectures before training. The study highlights the importance of finite-width fluctuations in deep networks and provides a path to extend these methods to other nonlinearities and architectures.

Abstract

Despite remarkable performance on a variety of tasks, many properties of deep neural networks are not yet theoretically understood. One such mystery is the depth degeneracy phenomenon: the deeper you make your network, the closer your network is to a constant function on initialization. In this paper, we examine the evolution of the angle between two inputs to a ReLU neural network as a function of the number of layers. By using combinatorial expansions, we find precise formulas for how fast this angle goes to zero as depth increases. These formulas capture microscopic fluctuations that are not visible in the popular framework of infinite width limits, and leads to qualitatively different predictions. We validate our theoretical results with Monte Carlo experiments and show that our results accurately approximate finite network behaviour. \review{We also empirically investigate how the depth degeneracy phenomenon can negatively impact training of real networks.} The formulas are given in terms of the mixed moments of correlated Gaussians passed through the ReLU function. We also find a surprising combinatorial connection between these mixed moments and the Bessel numbers that allows us to explicitly evaluate these moments.

Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization

TL;DR

and derives a mean/variance description for

; crucially, these results differ from the infinite-width limit by incorporating a width-dependent correction

and nonzero variance. The authors introduce a Gaussian-IBP framework to compute

, reveal a combinatorial connection to the Bessel numbers via

, and provide explicit closed forms that enable accurate finite-width predictions. These results are validated through Monte Carlo simulations and applied to neural architecture search-like experiments, showing that smaller predicted angles correlate with poorer training outcomes and offering a practical tool to screen architectures before training. The study highlights the importance of finite-width fluctuations in deep networks and provides a path to extend these methods to other nonlinearities and architectures.

Abstract

Paper Structure (30 sections, 18 theorems, 108 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 30 sections, 18 theorems, 108 equations, 7 figures, 10 tables, 1 algorithm.

Introduction
Main Results for the Angle Process $\theta_\ell$
Theoretical Consequences and Comparison to Previous Work
More detailed results for the mean and variance
Practical Consequences: Depth Degeneracy Negatively Impacts Training
Comparison to Infinite Width Update Rule
J Functions and Infinite Width Limits
Outline
ReLU Neural Networks on Initialization
Expected Value
Variance of $\ln(\sin^2(\theta_{\ell+1}))$
Explicit Formula for the Mixed-Moment J Functions
Statement of Main Results and Outline of Method
Gaussian Integration-by-Parts Formulas
Recursive Formulas for $J_{a,b}(\theta)$ - Proof of Proposition \ref{['prop:rec_relations']}
...and 15 more sections

Key Result

Theorem 1

Conditionally on the angle $\theta_\ell$ in layer $\ell$ (see Table tbl:notation for a precise definition of all the notations), the mean and variance of $\ln \sin^2 (\theta_{\ell+1})$ obey the following limit as the layer width $n_\ell \to \infty$ where $J_{a,b} := J_{a,b}(\theta_\ell)$ are the joint moments of correlated Gaussians passed through the ReLU function $\varphi(x)=\max\{x,0\}$, namely

Figures (7)

Figure 1: We feed 2 inputs with initial angle $\theta_0 = 0.1$ into 5000 Monte Carlo samples of independently initialized networks with network width $n_\ell = 256$ for all layers. Left: Using the Monte Carlo samples, we plot the empirical mean and standard deviation of $\ln(\sin^2(\theta_\ell))$ at each layer. We compare this to both the infinite width update rule and our prediction using Approximation \ref{['approx:simple']} for the mean of $\ln(\sin^2(\theta_\ell))$ (Shown as the blue square). Our prediction for the standard deviation in each layer using Approximation \ref{['approx:Gaussian']} is also plotted as the shaded area. To compute this, we iterate Approximation 2 to estimate the PDF in each layer, and then compute the variance using the PDF. In contrast to our prediction, the infinite width rule predicts 0 variance in all layers. Right: We plot histograms of our simulations as well as our predicted probability density function using Approximation \ref{['approx:Gaussian']} from \ref{['eq:update_full']} at Layer 1 (top) and Layer 30 (bottom). The predicted PDF is computed numerically by iterating Approximation \ref{['approx:Gaussian']} over the 30 layers, using the PDF in layer $\ell$ to get the PDF in layer $\ell+1$. The predicted and empirical distribution are statistically indistinguishable according to a Kolmogorov-Smirnov test, with $p$ values $0.987 > 0.05$ (top) and $0.186 > 0.05$ (bottom). The code which produced this figure can be found at the following https://github.com/camjakub/Depth-Degeneracy-in-Neural-Networks.
Figure 2: Plots comparing the functions $\mu(\theta, n)$ and $\sigma^2(\theta,n)$ to simulated neural networks. The linear approximation of $\mu$, used to create Approximation \ref{['approx:simple']} is also displayed. Confidence bands are constructed by randomly initializing 10,000 neural networks with layer width $n_\ell=1024$, and a range of 100 initial angles $0.005 \leq \theta_\ell \leq 0.8$. We study $\theta_{\ell+1}$ and use the simulations to construct 99% confidence intervals for a) $\mathbf{E}\left[ \ln(\sin^2(\theta_\ell)) - \ln(\sin^2(\theta_{\ell+1}))\right]$ and b) $\mathop{\mathrm{\mathbf{Var}}}\nolimits\left[ \ln(\sin^2(\theta_{\ell+1}))\right]$.
Figure 3: We compare 45 different network architectures trained on the MNIST mnist, Fashion-MNIST fmnist, and CIFAR-10 cifar datasets 10 times each. Using the architecture of the network and \ref{['algo:update_rule']}, we predict the angle between 2 orthogonal inputs at the final output layer of the network on initialization. We express the angle as $\ln(\sin^2(\theta_L))$, to follow the form used when developing the finite width approximations. The angle is plotted against the accuracy of each network on the test data after training, with error bars representing a 95% confidence interval across the 10 runs. We observe that small angle $\theta_L$ is related to lower test accuracy. All networks are trained using 1 epoch, batch size $=100$, categorical cross-entropy loss, the ADAM optimizer, and default learning rate in the Keras module of TensorFlow tensorflow. See Appendix \ref{['app:network_architectures']} for details on all of the network architectures used. The code which produced this figure can be found at https://github.com/camjakub/Depth-Degeneracy-in-Neural-Networks.
Figure 4: Left: Comparison of the finite and infinite width predictions for 5 network architectures with a depth of $L=3$ trained 10 times each on the CIFAR-10 dataset cifar. The infinite width predicts the same final angle for all networks, since it only depends on network depth. Right: Using the same 45 network architectures as in \ref{['fig:simulations']}, we plot a comparison of the predicted angle $\theta_L$ using \ref{['algo:update_rule']} (finite width) versus the infinite width prediction. We see that the infinite width prediction tends to underestimate the rate at which $\theta_\ell$ tends towards 0.
Figure 5: The graph associated with the recursions for $J$ in \ref{['eq:J_rec']} (left) and $J^\ast$ in \ref{['newrecursion']} (right). The graph is defined so that the recursion is given by a sum of incoming edges as in \ref{['eq:graph']}. The edges are color coded red and blue to match the coefficients in the recursion.
...and 2 more figures

Theorems & Definitions (27)

Theorem 1: Formula for mean and variance in terms of J functions
Corollary 2: Small $\theta$ asymptotics for mean and variance
Proposition 1: Recurrence relations for $J_{a,b}$
Proposition 2: Explicit Formula for $J_{0,0},J_{1,0},J_{1,1}$
Proposition 3: Explicit Formulas for $J_{a,0}(\theta)$, $J_{a,1}(\theta)$
Definition 3: Bessel numbers
Theorem 4: Explicit Formula for $J_{a,b}(\theta)$
Remark 5
Remark 6
Remark 7
...and 17 more

Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization

TL;DR

Abstract

Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (27)