Table of Contents
Fetching ...

The Spectral Dimension of NTKs is Constant: A Theory of Implicit Regularization, Finite-Width Stability, and Scalable Estimation

Praveen Anilkumar Shukla

TL;DR

This work introduces the effective rank $r_{ ext{eff}}(K_n)$ of the NTK Gram as a concise scalar proxy for spectral concentration, and proves a constant-limit law for its expectation with sub-Gaussian concentration, independent of sample size. It establishes finite-width stability: small perturbations from finite width induce only $O_p(m^{-1/2})$ changes in $r_{ ext{eff}}$, and provides a scalable, unbiased estimator using random output probes and a CountSketch with concrete variance bounds. A Mercer-spectrum analysis connects the limit to kernel eigenvalue decay, characterizing when the limit is finite via a power-law exponent $\alpha$, and experiments on CIFAR-10 with ResNet-20/56 show $r_{ ext{eff}}\approx 1.0$–$1.3$ across $n$, consistent with theory. Overall, the paper offers a principled, scalable measure of implicit simplicity in wide NTK regimes and validates it with large-scale empirical evidence.

Abstract

Modern deep networks are heavily overparameterized yet often generalize well, suggesting a form of low intrinsic complexity not reflected by parameter counts. We study this complexity at initialization through the effective rank of the Neural Tangent Kernel (NTK) Gram matrix, $r_{\text{eff}}(K) = (\text{tr}(K))^2/\|K\|_F^2$. For i.i.d. data and the infinite-width NTK $k$, we prove a constant-limit law $\lim_{n\to\infty} \mathbb{E}[r_{\text{eff}}(K_n)] = \mathbb{E}[k(x, x)]^2 / \mathbb{E}[k(x, x')^2] =: r_\infty$, with sub-Gaussian concentration. We further establish finite-width stability: if the finite-width NTK deviates in operator norm by $O_p(m^{-1/2})$ (width $m$), then $r_{\text{eff}}$ changes by $O_p(m^{-1/2})$. We design a scalable estimator using random output probes and a CountSketch of parameter Jacobians and prove conditional unbiasedness and consistency with explicit variance bounds. On CIFAR-10 with ResNet-20/56 (widths 16/32) across $n \in \{10^3, 5\times10^3, 10^4, 2.5\times10^4, 5\times10^4\}$, we observe $r_{\text{eff}} \approx 1.0\text{--}1.3$ and slopes $\approx 0$ in $n$, consistent with the theory, and the kernel-moment prediction closely matches fitted constants.

The Spectral Dimension of NTKs is Constant: A Theory of Implicit Regularization, Finite-Width Stability, and Scalable Estimation

TL;DR

This work introduces the effective rank of the NTK Gram as a concise scalar proxy for spectral concentration, and proves a constant-limit law for its expectation with sub-Gaussian concentration, independent of sample size. It establishes finite-width stability: small perturbations from finite width induce only changes in , and provides a scalable, unbiased estimator using random output probes and a CountSketch with concrete variance bounds. A Mercer-spectrum analysis connects the limit to kernel eigenvalue decay, characterizing when the limit is finite via a power-law exponent , and experiments on CIFAR-10 with ResNet-20/56 show across , consistent with theory. Overall, the paper offers a principled, scalable measure of implicit simplicity in wide NTK regimes and validates it with large-scale empirical evidence.

Abstract

Modern deep networks are heavily overparameterized yet often generalize well, suggesting a form of low intrinsic complexity not reflected by parameter counts. We study this complexity at initialization through the effective rank of the Neural Tangent Kernel (NTK) Gram matrix, . For i.i.d. data and the infinite-width NTK , we prove a constant-limit law , with sub-Gaussian concentration. We further establish finite-width stability: if the finite-width NTK deviates in operator norm by (width ), then changes by . We design a scalable estimator using random output probes and a CountSketch of parameter Jacobians and prove conditional unbiasedness and consistency with explicit variance bounds. On CIFAR-10 with ResNet-20/56 (widths 16/32) across , we observe and slopes in , consistent with the theory, and the kernel-moment prediction closely matches fitted constants.

Paper Structure

This paper contains 20 sections, 8 theorems, 31 equations, 1 figure, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

Under asmp:tails, with $a=\mathbb{E}[k(x,x)]$ and $b=\mathbb{E}[k(x,x')^2]$, Moreover there exist $c,C>0$ (depending on sub-exponential norms) s.t. for all $\varepsilon\!>\!0$ and large $n$,

Figures (1)

  • Figure 1: $r_{\mathrm{eff}}$ vs. $1/n$ (depth=20, widths=16 and 32). Shading denotes variability across seeds.

Theorems & Definitions (15)

  • Theorem 3.1: Constant-limit effective rank
  • proof
  • Remark 3.1: Mercer representation
  • Lemma 4.1: Probe identity
  • proof
  • Lemma 4.2: CountSketch inner-product preservation
  • proof
  • Theorem 4.1: Unbiasedness
  • Theorem 4.2: Variance bounds
  • proof
  • ...and 5 more