Table of Contents
Fetching ...

A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities

Yatin Dandi, Luca Pesce, Hugo Cui, Florent Krzakala, Yue M. Lu, Bruno Loureiro

TL;DR

This work provides a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step, and rigorously establishes the equivalence between the updated features and an isotropic spiked random feature model, in the limit of large batch size.

Abstract

A key property of neural networks is their capacity of adapting to data during training. Yet, our current mathematical understanding of feature learning and its relationship to generalization remain limited. In this work, we provide a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. We rigorously establish the equivalence between the updated features and an isotropic spiked random feature model, in the limit of large batch size. For the latter model, we derive a deterministic equivalent description of the feature empirical covariance matrix in terms of certain low-dimensional operators. This allows us to sharply characterize the impact of training in the asymptotic feature spectrum, and in particular, provides a theoretical grounding for how the tails of the feature spectrum modify with training. The deterministic equivalent further yields the exact asymptotic generalization error, shedding light on the mechanisms behind its improvement in the presence of feature learning. Our result goes beyond standard random matrix ensembles, and therefore we believe it is of independent technical interest. Different from previous work, our result holds in the challenging maximal learning rate regime, is fully rigorous and allows for finitely supported second layer initialization, which turns out to be crucial for studying the functional expressivity of the learned features. This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.

A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities

TL;DR

This work provides a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step, and rigorously establishes the equivalence between the updated features and an isotropic spiked random feature model, in the limit of large batch size.

Abstract

A key property of neural networks is their capacity of adapting to data during training. Yet, our current mathematical understanding of feature learning and its relationship to generalization remain limited. In this work, we provide a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. We rigorously establish the equivalence between the updated features and an isotropic spiked random feature model, in the limit of large batch size. For the latter model, we derive a deterministic equivalent description of the feature empirical covariance matrix in terms of certain low-dimensional operators. This allows us to sharply characterize the impact of training in the asymptotic feature spectrum, and in particular, provides a theoretical grounding for how the tails of the feature spectrum modify with training. The deterministic equivalent further yields the exact asymptotic generalization error, shedding light on the mechanisms behind its improvement in the presence of feature learning. Our result goes beyond standard random matrix ensembles, and therefore we believe it is of independent technical interest. Different from previous work, our result holds in the challenging maximal learning rate regime, is fully rigorous and allows for finitely supported second layer initialization, which turns out to be crucial for studying the functional expressivity of the learned features. This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.

Paper Structure

This paper contains 41 sections, 37 theorems, 320 equations, 2 figures.

Key Result

Lemma 3.1

Let $W^{(1)} \in \mathbb{R}^{p \times d}$ denote the weight matrix after the first gradient step. Then, under Assumptions ass:init, ass:highd, and ass:activation: where $w^\star$ is the target vector in eq. eq:def:data, and $u = \eta c_1c^\star_1 a^0/\sqrt{p}$ as defined in eq. eq:main:rank_one_approx, with $\eta$ being the learning rate.

Figures (2)

  • Figure 1: Bulk spectrum of the empirical features covariance at initialization (dashed blue) and after training (green); the red line corresponds to the theoretical characterization derived in this manuscript.
  • Figure 2: Increase fitting accuracy through second layer variability: Illustration of the benefits of larger support for the $2^{\rm nd}$ layer values $\sigma = \mathrm{ReLu}, \sigma_\star = \mathrm{tanh}$. Theoretical (continuous lines) and numerical (dots) predictions for the generalization error as a function of the number of samples per dimension $\alpha$ for different values of the second layer vocabulary size $k \in (1,2,4)$. The numerical simulations are averaged over 5 seeds and fixed hyper-parameters $\lambda = 0.01, \gamma = 0.5, \beta = 1.5, p = 2048$. Note the significant drop in the generalization error for $k>1$. The choice of the probabilities ${\bf \pi} = \{\pi_q\}_{q \in [k]}$ and the vocabulary ${\bf \zeta} = \{\zeta_q\}_{q \in [k] }$ for the numerical illustration are: a) ${\bf k = 1}: {\bf \pi} = \{1\}, {\bf{\zeta}} = \{1\}$; b) ${\bf k = 2}: {\bf \pi} = \{0.9,0.1\}, {\bf{\zeta}} = \{1,-1\}$; c) ${\bf k = 4}: {\bf \pi} = \{0.7,0.1,0.1,0.1\}, {\bf{\zeta}} = \{1,-0.5,1.5-2\}$

Theorems & Definitions (69)

  • Lemma 3.1
  • Definition 4.1: Extended resolvent
  • Definition 4.2: Shifted Hermite coefficient
  • Definition 4.4
  • Definition 4.5: Deterministic equivalent
  • Theorem 4.6: Deterministic equivalent
  • Corollary 4.7: Stieltjes transform
  • Theorem 4.8: Generalization Error
  • Definition B.1
  • Definition B.2: Hermite Expansion
  • ...and 59 more