Table of Contents
Fetching ...

On the Stability of the Jacobian Matrix in Deep Neural Networks

Benjamin Dadoun, Soufiane Hayou, Hanan Salam, Mohamed El Amine Seddik, Pierre Youssef

TL;DR

Deep neural networks often suffer from vanishing or exploding gradients due to the spectral behavior of the input–output Jacobian $J_1=\prod_{l=1}^L W_l D_{l-1}$. The paper develops a general stability theorem for products of random matrices, applicable to sparse and non‑iid, weakly correlated weights via random matrix theory and asymptotic freeness, showing $\|J_k^B\|$ converges to the norm of a product of semicircular elements $\|s_k\cdots s_L\|$. It then specializes to sparse networks, demonstrating that appropriate weight scaling after pruning (e.g., $(1-s_n)^{-1/2}$) preserves stability and that an edge of stability emerges at high sparsity; similar scaling-based stability holds for score-based pruning with method‑dependent factors. The paper also proves that dependent weights with sufficiently small width‑dependent correlations retain stability in expectation, and it provides substantial empirical validation of the theory, including independence tests for diagonal activations, effects of scaling on trainability, and the behavior under correlated weights. Overall, this work broadens the theoretical foundation for initializing and pruning modern neural networks, offering practical guidance for maintaining Jacobian stability in structures beyond fully i.i.d. weight initialization.

Abstract

Deep neural networks are known to suffer from exploding or vanishing gradients as depth increases, a phenomenon closely tied to the spectral behavior of the input-output Jacobian. Prior work has identified critical initialization schemes that ensure Jacobian stability, but these analyses are typically restricted to fully connected networks with i.i.d. weights. In this work, we go significantly beyond these limitations: we establish a general stability theorem for deep neural networks that accommodates sparsity (such as that introduced by pruning) and non-i.i.d., weakly correlated weights (e.g. induced by training). Our results rely on recent advances in random matrix theory, and provide rigorous guarantees for spectral stability in a much broader class of network models. This extends the theoretical foundation for initialization schemes in modern neural networks with structured and dependent randomness.

On the Stability of the Jacobian Matrix in Deep Neural Networks

TL;DR

Deep neural networks often suffer from vanishing or exploding gradients due to the spectral behavior of the input–output Jacobian . The paper develops a general stability theorem for products of random matrices, applicable to sparse and non‑iid, weakly correlated weights via random matrix theory and asymptotic freeness, showing converges to the norm of a product of semicircular elements . It then specializes to sparse networks, demonstrating that appropriate weight scaling after pruning (e.g., ) preserves stability and that an edge of stability emerges at high sparsity; similar scaling-based stability holds for score-based pruning with method‑dependent factors. The paper also proves that dependent weights with sufficiently small width‑dependent correlations retain stability in expectation, and it provides substantial empirical validation of the theory, including independence tests for diagonal activations, effects of scaling on trainability, and the behavior under correlated weights. Overall, this work broadens the theoretical foundation for initializing and pruning modern neural networks, offering practical guidance for maintaining Jacobian stability in structures beyond fully i.i.d. weight initialization.

Abstract

Deep neural networks are known to suffer from exploding or vanishing gradients as depth increases, a phenomenon closely tied to the spectral behavior of the input-output Jacobian. Prior work has identified critical initialization schemes that ensure Jacobian stability, but these analyses are typically restricted to fully connected networks with i.i.d. weights. In this work, we go significantly beyond these limitations: we establish a general stability theorem for deep neural networks that accommodates sparsity (such as that introduced by pruning) and non-i.i.d., weakly correlated weights (e.g. induced by training). Our results rely on recent advances in random matrix theory, and provide rigorous guarantees for spectral stability in a much broader class of network models. This extends the theoretical foundation for initialization schemes in modern neural networks with structured and dependent randomness.

Paper Structure

This paper contains 12 sections, 7 theorems, 57 equations, 9 figures.

Key Result

theorem 1

Assume that the weights are i.i.d initialized as $W_k^{ij} \sim \mathcal{N}(0,\sigma_w^2/n)$ for some $\sigma_w >0$. Then, in the limit $n \to \infty$, under approx:i.i.d_bernoulli, we have the following: As a result, the choice $\sigma_w^2 = 2$ guarantees stability.

Figures (9)

  • Figure 1: Illustration of the Jacobian norm at initialization in an MLP network of width $n=256$ and varying depth. The input is randomly selected from MNIST. All results are averaged over 3 runs. (Left) Impact of depth on the Jacobian norm for different $\sigma_w$. (Right) Evolution of the Jacobian norm as a function of depth for critical initialization.
  • Figure 2: Jacobian norm after pruning at initialization as depth increases in a randomly pruned MLP of width $n = 256$ (input sampled randomly from MNIST). (Left) Without scaling. (Right) With scaling.
  • Figure 3: Jacobian norm after pruning at initialization as depth increases in a score-based pruned MLP of width $n = 256$ (input sampled randomly from MNIST). (Upper Left) Without scaling (scaling factor = 1). (Upper Right) scaling factor $= (1 - s_{n})^{-\frac{1}{2}}$. (Lower Left) Scaling factor is calculated based on Theorem \ref{['thm:magnitude-based']}. (Lower Right) Comparison of the scaling factors as a function of sparsity level.
  • Figure 4: Illustration of the Jacobian norm for a randomly selected input with an MLP architecture of width $n = 256$ and varying depths. All results are averaged over 3 runs, and confidence intervals are highlighted with shaded areas. (Left) Impact of Depth on the Jacobian norm for different correlation levels. (Right) Impact of the injection of the correlation between the weights on the Jacobian norm.
  • Figure 5: Joint distributions of three randomly selected entries of $D_l$ (denoted by $1$, $2$, and $3$) for $l=10$ in a depth $L=30$ and width $n=100$ MLP with a randomly selected input, based on $N=1000$ simulations. Since the values of the entries are binary ($0$ or $1$) we added random Gaussian noise (variance $0.01$) to the points for better visibility.
  • ...and 4 more figures

Theorems & Definitions (14)

  • definition 1: Stable Jacobian
  • theorem 1: Corollary of Eq. (17) in pennington2017
  • theorem 2: Stability theorem
  • lemma 3.1
  • proof
  • proof : Proof of \ref{['thm:stability_theorem']}
  • theorem 3: Scaling guarantees stability
  • proof
  • theorem 4: Magnitude-based pruning, deterministic threshold
  • proof
  • ...and 4 more