Table of Contents
Fetching ...

Power-Law Spectrum of the Random Feature Model

Elliot Paquette, Ke Liang Xiao, Yizhe Zhu

Abstract

Scaling laws for neural networks, in which the loss decays as a power-law in the number of parameters, data, and compute, depend fundamentally on the spectral structure of the data covariance, with power-law eigenvalue decay appearing ubiquitously in vision and language tasks. A central question is whether this spectral structure is preserved or destroyed when data passes through the basic building block of a neural network: a random linear projection followed by a nonlinear activation. We study this question for the random feature model: given data $x \sim N(0,H)\in \mathbb{R}^v$ where $H$ has $α$-power-law spectrum ($λ_j(H ) \asymp j^{-α}$, $α> 1$), a Gaussian sketch matrix $W \in \mathbb{R}^{v\times d}$, and an entrywise monomial $f(y) = y^{p}$, we characterize the eigenvalues of the population random-feature covariance $\mathbb{E}_{x }[\frac{1}{d}f(W^\top x )^{\otimes 2}]$. We prove matching upper and lower bounds: for all $1 \leq j \leq c_1 d \log^{-(p+1)}(d)$, the $j$-th eigenvalue is of order $\left(\log^{p-1}(j+1)/j\right)^α$. For $ c_1 d \log^{-(p+1)}(d)\leq j\leq d$, the $j$-th eigenvalue is of order $j^{-α}$ up to a polylog factor. That is, the power-law exponent $α$ is inherited exactly from the input covariance, modified only by a logarithmic correction that depends on the monomial degree $p$. The proof combines a dyadic head-tail decomposition with Wick chaos expansions for higher-order monomials and random matrix concentration inequalities.

Power-Law Spectrum of the Random Feature Model

Abstract

Scaling laws for neural networks, in which the loss decays as a power-law in the number of parameters, data, and compute, depend fundamentally on the spectral structure of the data covariance, with power-law eigenvalue decay appearing ubiquitously in vision and language tasks. A central question is whether this spectral structure is preserved or destroyed when data passes through the basic building block of a neural network: a random linear projection followed by a nonlinear activation. We study this question for the random feature model: given data where has -power-law spectrum (, ), a Gaussian sketch matrix , and an entrywise monomial , we characterize the eigenvalues of the population random-feature covariance . We prove matching upper and lower bounds: for all , the -th eigenvalue is of order . For , the -th eigenvalue is of order up to a polylog factor. That is, the power-law exponent is inherited exactly from the input covariance, modified only by a logarithmic correction that depends on the monomial degree . The proof combines a dyadic head-tail decomposition with Wick chaos expansions for higher-order monomials and random matrix concentration inequalities.
Paper Structure (32 sections, 23 theorems, 241 equations, 12 figures, 1 table)

This paper contains 32 sections, 23 theorems, 241 equations, 12 figures, 1 table.

Key Result

Theorem 1

Let $\alpha>1$ and $\mathbf{x} \sim N(0,\mathbf{H})$ such that $\mathbf{H} \in \mathbb{R}^{v \times v}$ has $\alpha$-power-law spectrum. If $v \geq d$ and the sketch matrix $\mathbf{W} \in \mathbb{R}^{v \times d}$ has i.i.d. standard Gaussian entries, then given a monomial $f(y) = y^{\text{p}}$ and Moreover, if $c_1 d \log^{-(\text{p}+1)}(d) <j \leq d$ then and

Figures (12)

  • Figure 1: Power-law spectral preservation. Eigenvalue spectra (normalized by $\lambda_1$) on log-log axes. (a) Iterated linear sketches: starting from a population covariance $D = \mathrm{diag}(j^{-1.31})$ with $v = 10{,}000$ (red line), we apply three successive Gaussian sketches ($10{,}000 \to 3{,}000 \to 1{,}000 \to 1{,}000$). All three sketched spectra collapse onto the population power-law; only the tails peel off where the sketch dimension limits the rank. (b) A $4$-layer $\tanh$ MLP (width $1024$, random initialization) applied to CIFAR-10 data. All five spectra---input and four hidden layers---collapse onto the same $j^{-1.31}$ power-law, confirming that the spectral exponent is preserved across multiple nonlinear layers on real data.
  • Figure 2: Median postactivation eigenvalue (normalized so the top eigenvalue equals $1$) across 100 independent trials for Gaussian data with $\alpha = 1.31$, using $v = d = 10{,}000$ and $m = 100{,}000$ Monte Carlo samples. The dashed line shows the reference $j^{-\alpha}$. Left: monomial activations $f(y) = y^{\text{p}}$ for $\text{p} = 1,\ldots,6$; all exhibit power-law decay. Right: non-monomial activations (ReLU, $\tanh$, Heaviside, $z^2 e^{-z^2}$) also display power-law spectra, with all but $z^2 e^{-z^2}$ tracking the reference slope closely.
  • Figure 3: Eigenvalue spectra of the random-feature covariance for four data distributions (Gaussian, Rademacher, Student-$t(4)$, and CIFAR-10) with $\alpha = 1.31$, $d = 3{,}000$, and $m = 30{,}000$ Monte Carlo samples. Each panel shows a different activation function. The spectral slopes are consistent across distributions, providing strong evidence of universality. For $x^2$ and $x^3$, the grey curve shows a zero-free-parameter theory prediction obtained by combining the lattice-point counts of all composition types in the Wick decomposition; this curve is asymptotically equivalent to the $(\log^{\text{p}-1}(j+1)/j)^\alpha$ rate of Theorem \ref{['theorem:MAIN']} but captures the subleading corrections. See Remark \ref{['rem:theory_curve']} for details.
  • Figure 4: If $\pi = (4,2)$, then there are $4$ copies of $i_1$ and $2$ copies of $i_2$. The Feynman diagrams $\gamma_1, \cdots, \gamma_6$ are all mapped to $\eta = (1,1)$. Which is to say that they all yield the same Wick product $\mathbf{H}_{i_1}\mathbf{H}_{i_2}\left \langle \mathbf{y}^{(a),2}_{i_1} \right \rangle$. Thus, $N_{(1,1)}=6$.
  • Figure 5: For the set $g_{a_1},g_{a_2},g_{a_3},g_{a_4}$, there exists $10$ distinct Feynman diagrams.
  • ...and 7 more figures

Theorems & Definitions (53)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • proof
  • Proposition 1: Principal term
  • proof
  • Definition 2
  • Definition 3
  • Example 1: Restricted ordered tuple $\mathcal{I}_{\pi,\eta}$
  • Lemma 1
  • ...and 43 more