Nonlinear spiked covariance matrices and signal propagation in deep neural networks

Zhichao Wang; Denny Wu; Zhou Fan

Nonlinear spiked covariance matrices and signal propagation in deep neural networks

Zhichao Wang, Denny Wu, Zhou Fan

TL;DR

This work addresses how low-rank, spike-like structure in high-dimensional inputs manifests as outlier eigenvalues and aligned eigenvectors in nonlinear conjugate kernel matrices, such as CK from deep neural networks. It develops a nonlinear spiked covariance framework, proving BBP-type phase transitions and first-order spike limits, and provides deterministic equivalents that connect nonlinear spike behavior to linear models via Gaussian equivalence. The results quantify how input spikes propagate through depth, how gradient-descent training can create and amplify rank-one spikes in weight matrices, and how these spikes reflect in CK and test-data alignment. The findings offer a principled spectral lens for understanding representation learning and signal propagation in deep networks, with implications for architecture design and training dynamics.

Abstract

Many recent works have studied the eigenvalue spectrum of the Conjugate Kernel (CK) defined by the nonlinear feature map of a feedforward neural network. However, existing results only establish weak convergence of the empirical eigenvalue distribution, and fall short of providing precise quantitative characterizations of the ''spike'' eigenvalues and eigenvectors that often capture the low-dimensional signal structure of the learning problem. In this work, we characterize these signal eigenvalues and eigenvectors for a nonlinear version of the spiked covariance model, including the CK as a special case. Using this general result, we give a quantitative description of how spiked eigenstructure in the input data propagates through the hidden layers of a neural network with random weights. As a second application, we study a simple regime of representation learning where the weight matrix develops a rank-one signal component over training and characterize the alignment of the target function with the spike eigenvector of the CK on test data.

Nonlinear spiked covariance matrices and signal propagation in deep neural networks

TL;DR

Abstract

Paper Structure (33 sections, 35 theorems, 314 equations, 3 figures)

This paper contains 33 sections, 35 theorems, 314 equations, 3 figures.

Introduction
Our Contributions
Related Works
Eigenvalues of nonlinear random matrices.
Precise error analysis of NNs.
Eigenvalues of sample covariance matrices.
Results for neural network models
Propagation of signal through multi-layer neural networks
Numerical illustration.
CK matrix after $O(1)$ steps of gradient descent
Numerical illustration.
Analysis of a nonlinear spiked covariance model
Proof ideas.
Notations and background
Stochastic domination
...and 18 more sections

Key Result

Theorem 2

Suppose Assumptions assump:NNasymptotics, assump:data, and assump:sigma hold. Then for each $\ell=1,\ldots,L$, eq:CKweakconvergence holds weakly a.s. as $n \to \infty$. Furthermore, if the number of spikes is $r=0$ in Assumption assump:data, then for any fixed $\varepsilon>0$, almost surely for all

Figures (3)

Figure 1: Spectra of three-layer CK matrices defined by \ref{['eq:K_L']} with $n = 5000$, $d_0=d_1=d_2 = 15000$, and $\sigma\propto \arctan$. Input data is a GMM satisfying \ref{['eq:gmm']} with $r=3$, $\theta_1=2.0$, $\theta_2=1.18$, and $\theta_3=1.0$. (a)-(c) are theoretically predicted (red) and empirical (blue) bulk distributions and spikes of $\boldsymbol{K}_\ell$ for $\ell=0,1,2$.
Figure 2: We consider multiple-layer NNs in \ref{['eq:NN']} with $\sigma\propto \tanh$ on Gaussian mixture data \ref{['eq:gmm']} for $r=1$, and compute the alignment between the largest eigenvector of the CK matrix $\boldsymbol{K}_\ell$ with genuine signal $\boldsymbol{b}_1$ (class labels) for different layer $\ell$. (a) NNs at random initialization with varying hidden widths $N=2048,4096,8192,10240$. (b) NNs trained by gradient descent with learning rate $\eta=0.1$ for varying steps $T=0,10,20,50$; we use the $\mu$-parameterization yang2020feature to encourage feature learning. $\theta_1$ is $2.5$ and $1.8$ for (a) and (b), respectively. Dots are empirical values (over 10 runs) and solid curves represent theoretical predictions at random initialization from Theorem \ref{['thm:ck_spike']}.
Figure 3: $(a)$ We set $n=2000, d=1600,N=2400,\eta\cdot t = 2$, and $\sigma=\sigma_*=\text{erf}$. $(b)$ We set $d=2048,N=1024,\eta=0.2$, $\sigma=\text{tanh}, \sigma_*=\text{SoftPlus}$, and vary the sample size $n$ and number of GD steps $t$; dots represent empirical simulations (over 10 runs) and solid curves are theoretical predictions from Theorem \ref{['thm:gd_spike']}.

Theorems & Definitions (36)

Definition 1
Theorem 2
Theorem 3
Corollary 4
Theorem 5
Theorem 6: informal
Proposition 7
Theorem 8
Theorem 9
Corollary 10
...and 26 more

Nonlinear spiked covariance matrices and signal propagation in deep neural networks

TL;DR

Abstract

Nonlinear spiked covariance matrices and signal propagation in deep neural networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (36)