On the Stability of the Jacobian Matrix in Deep Neural Networks
Benjamin Dadoun, Soufiane Hayou, Hanan Salam, Mohamed El Amine Seddik, Pierre Youssef
TL;DR
Deep neural networks often suffer from vanishing or exploding gradients due to the spectral behavior of the input–output Jacobian $J_1=\prod_{l=1}^L W_l D_{l-1}$. The paper develops a general stability theorem for products of random matrices, applicable to sparse and non‑iid, weakly correlated weights via random matrix theory and asymptotic freeness, showing $\|J_k^B\|$ converges to the norm of a product of semicircular elements $\|s_k\cdots s_L\|$. It then specializes to sparse networks, demonstrating that appropriate weight scaling after pruning (e.g., $(1-s_n)^{-1/2}$) preserves stability and that an edge of stability emerges at high sparsity; similar scaling-based stability holds for score-based pruning with method‑dependent factors. The paper also proves that dependent weights with sufficiently small width‑dependent correlations retain stability in expectation, and it provides substantial empirical validation of the theory, including independence tests for diagonal activations, effects of scaling on trainability, and the behavior under correlated weights. Overall, this work broadens the theoretical foundation for initializing and pruning modern neural networks, offering practical guidance for maintaining Jacobian stability in structures beyond fully i.i.d. weight initialization.
Abstract
Deep neural networks are known to suffer from exploding or vanishing gradients as depth increases, a phenomenon closely tied to the spectral behavior of the input-output Jacobian. Prior work has identified critical initialization schemes that ensure Jacobian stability, but these analyses are typically restricted to fully connected networks with i.i.d. weights. In this work, we go significantly beyond these limitations: we establish a general stability theorem for deep neural networks that accommodates sparsity (such as that introduced by pruning) and non-i.i.d., weakly correlated weights (e.g. induced by training). Our results rely on recent advances in random matrix theory, and provide rigorous guarantees for spectral stability in a much broader class of network models. This extends the theoretical foundation for initialization schemes in modern neural networks with structured and dependent randomness.
