Power-Law Spectrum of the Random Feature Model

Elliot Paquette; Ke Liang Xiao; Yizhe Zhu

Power-Law Spectrum of the Random Feature Model

Elliot Paquette, Ke Liang Xiao, Yizhe Zhu

Abstract

Scaling laws for neural networks, in which the loss decays as a power-law in the number of parameters, data, and compute, depend fundamentally on the spectral structure of the data covariance, with power-law eigenvalue decay appearing ubiquitously in vision and language tasks. A central question is whether this spectral structure is preserved or destroyed when data passes through the basic building block of a neural network: a random linear projection followed by a nonlinear activation. We study this question for the random feature model: given data $x \sim N(0,H)\in \mathbb{R}^v$ where $H$ has $α$-power-law spectrum ($λ_j(H ) \asymp j^{-α}$, $α> 1$), a Gaussian sketch matrix $W \in \mathbb{R}^{v\times d}$, and an entrywise monomial $f(y) = y^{p}$, we characterize the eigenvalues of the population random-feature covariance $\mathbb{E}_{x }[\frac{1}{d}f(W^\top x )^{\otimes 2}]$. We prove matching upper and lower bounds: for all $1 \leq j \leq c_1 d \log^{-(p+1)}(d)$, the $j$-th eigenvalue is of order $\left(\log^{p-1}(j+1)/j\right)^α$. For $ c_1 d \log^{-(p+1)}(d)\leq j\leq d$, the $j$-th eigenvalue is of order $j^{-α}$ up to a polylog factor. That is, the power-law exponent $α$ is inherited exactly from the input covariance, modified only by a logarithmic correction that depends on the monomial degree $p$. The proof combines a dyadic head-tail decomposition with Wick chaos expansions for higher-order monomials and random matrix concentration inequalities.

Power-Law Spectrum of the Random Feature Model

Abstract

where

has

-power-law spectrum (

), a Gaussian sketch matrix

, and an entrywise monomial

, we characterize the eigenvalues of the population random-feature covariance

. We prove matching upper and lower bounds: for all

, the

-th eigenvalue is of order

. For

, the

-th eigenvalue is of order

up to a polylog factor. That is, the power-law exponent

is inherited exactly from the input covariance, modified only by a logarithmic correction that depends on the monomial degree

. The proof combines a dyadic head-tail decomposition with Wick chaos expansions for higher-order monomials and random matrix concentration inequalities.

Paper Structure (32 sections, 23 theorems, 241 equations, 12 figures, 1 table)

This paper contains 32 sections, 23 theorems, 241 equations, 12 figures, 1 table.

Introduction
Related work
Comparison with wortsman2025kernel
Notation
Main results
Discussion and numerical experiments
Illustrating the main theorem.
Universality across data distributions.
Beyond monomial activations.
Persistence across layers.
Open questions.
Warm up: Linear Case
Higher order case
Principal term
Subleading terms and Wick chaoses
...and 17 more sections

Key Result

Theorem 1

Let $\alpha>1$ and $\mathbf{x} \sim N(0,\mathbf{H})$ such that $\mathbf{H} \in \mathbb{R}^{v \times v}$ has $\alpha$-power-law spectrum. If $v \geq d$ and the sketch matrix $\mathbf{W} \in \mathbb{R}^{v \times d}$ has i.i.d. standard Gaussian entries, then given a monomial $f(y) = y^{\text{p}}$ and Moreover, if $c_1 d \log^{-(\text{p}+1)}(d) <j \leq d$ then and

Figures (12)

Figure 1: Power-law spectral preservation. Eigenvalue spectra (normalized by $\lambda_1$) on log-log axes. (a) Iterated linear sketches: starting from a population covariance $D = \mathrm{diag}(j^{-1.31})$ with $v = 10{,}000$ (red line), we apply three successive Gaussian sketches ($10{,}000 \to 3{,}000 \to 1{,}000 \to 1{,}000$). All three sketched spectra collapse onto the population power-law; only the tails peel off where the sketch dimension limits the rank. (b) A $4$-layer $\tanh$ MLP (width $1024$, random initialization) applied to CIFAR-10 data. All five spectra---input and four hidden layers---collapse onto the same $j^{-1.31}$ power-law, confirming that the spectral exponent is preserved across multiple nonlinear layers on real data.
Figure 2: Median postactivation eigenvalue (normalized so the top eigenvalue equals $1$) across 100 independent trials for Gaussian data with $\alpha = 1.31$, using $v = d = 10{,}000$ and $m = 100{,}000$ Monte Carlo samples. The dashed line shows the reference $j^{-\alpha}$. Left: monomial activations $f(y) = y^{\text{p}}$ for $\text{p} = 1,\ldots,6$; all exhibit power-law decay. Right: non-monomial activations (ReLU, $\tanh$, Heaviside, $z^2 e^{-z^2}$) also display power-law spectra, with all but $z^2 e^{-z^2}$ tracking the reference slope closely.
Figure 3: Eigenvalue spectra of the random-feature covariance for four data distributions (Gaussian, Rademacher, Student-$t(4)$, and CIFAR-10) with $\alpha = 1.31$, $d = 3{,}000$, and $m = 30{,}000$ Monte Carlo samples. Each panel shows a different activation function. The spectral slopes are consistent across distributions, providing strong evidence of universality. For $x^2$ and $x^3$, the grey curve shows a zero-free-parameter theory prediction obtained by combining the lattice-point counts of all composition types in the Wick decomposition; this curve is asymptotically equivalent to the $(\log^{\text{p}-1}(j+1)/j)^\alpha$ rate of Theorem \ref{['theorem:MAIN']} but captures the subleading corrections. See Remark \ref{['rem:theory_curve']} for details.
Figure 4: If $\pi = (4,2)$, then there are $4$ copies of $i_1$ and $2$ copies of $i_2$. The Feynman diagrams $\gamma_1, \cdots, \gamma_6$ are all mapped to $\eta = (1,1)$. Which is to say that they all yield the same Wick product $\mathbf{H}_{i_1}\mathbf{H}_{i_2}\left \langle \mathbf{y}^{(a),2}_{i_1} \right \rangle$. Thus, $N_{(1,1)}=6$.
Figure 5: For the set $g_{a_1},g_{a_2},g_{a_3},g_{a_4}$, there exists $10$ distinct Feynman diagrams.
...and 7 more figures

Theorems & Definitions (53)

Definition 1
Theorem 1
Theorem 2
proof
Proposition 1: Principal term
proof
Definition 2
Definition 3
Example 1: Restricted ordered tuple $\mathcal{I}_{\pi,\eta}$
Lemma 1
...and 43 more

Power-Law Spectrum of the Random Feature Model

Abstract

Power-Law Spectrum of the Random Feature Model

Authors

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (53)