Table of Contents
Fetching ...

A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks

Behrad Moniri, Donghwan Lee, Hamed Hassani, Edgar Dobriban

TL;DR

The paper addresses the inability of one-step feature learning in two-layer networks with fixed learning rates to capture nonlinear components of the target. By letting the gradient step size scale as ${\eta \asymp n^{\alpha}}$ with ${\alpha\in(0,1/2)}$, it shows the emergence of ${\ell}$ spectral spikes in the updated feature matrix, each corresponding to polynomial features of degree ${1,\ldots,\ell}$. The left singular vectors associated with these spikes align with monomial features ${( ilde{\mathbf{X}} \boldsymbol{\beta})^{\circ k}}$, enabling learning of higher-degree components via ridge regression, and the authors provide equivalence theorems to rigorously characterize training and test errors through Gaussian equivalence. They prove that for ${\ell=1}$ nonlinear features are not learned, while for ${\ell=2}$ the network can learn quadratic components, with implications for when nonlinear representations improve generalization in high-dimensional settings.

Abstract

Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the training and test errors, we demonstrate that these non-linear features can enhance learning.

A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks

TL;DR

The paper addresses the inability of one-step feature learning in two-layer networks with fixed learning rates to capture nonlinear components of the target. By letting the gradient step size scale as with , it shows the emergence of spectral spikes in the updated feature matrix, each corresponding to polynomial features of degree . The left singular vectors associated with these spikes align with monomial features , enabling learning of higher-degree components via ridge regression, and the authors provide equivalence theorems to rigorously characterize training and test errors through Gaussian equivalence. They prove that for nonlinear features are not learned, while for the network can learn quadratic components, with implications for when nonlinear representations improve generalization in high-dimensional settings.

Abstract

Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the training and test errors, we demonstrate that these non-linear features can enhance learning.
Paper Structure (108 sections, 22 theorems, 282 equations, 3 figures)

This paper contains 108 sections, 22 theorems, 282 equations, 3 figures.

Key Result

Theorem 3.1

Let $\eta \asymp n^\alpha$ with $\frac{\ell - 1}{2\ell} < \alpha < \frac{\ell}{2\ell + 2}$ for some $\ell \in \mathbb{N}$. If Conditions cond:limit-cond:tehe hold, then for $c_k$ from Condition cond:he and $\mathbf{F}_0 = \sigma(\tilde{\mathbf{X}} \mathbf{W}_0^\top)$, where $\Vert \mathbf{\Delta} \Vert_\textnormal{op}= o(\sqrt{n})$ with probability $1 - o(1)$.

Figures (3)

  • Figure 1: Spectrum of the updated feature matrix for different regimes of the gradient step size $\eta$. Spikes corresponding to monomial features are added to the spectrum of the initial matrix. The number of spikes depends on the range of $\alpha$. See Theorems \ref{['thm:spectrum_of_feature_matrix']} and \ref{['thm:subspace']} for details.
  • Figure 2: Histogram of the scaled singular values (divided by $\sqrt{n}$) of the feature matrix $\mathbf{F} = \sigma(\tilde{\mathbf{X}}\mathbf{W}^\top)$ after the update with step size $\eta = n^{0.29}$$(\ell = 2)$. In this regime, two isolated spikes appear in the spectrum as stated in Theorem \ref{['thm:spectrum_of_feature_matrix']}. The top two left singular vectors ${\bm{u}}_1$ and ${\bm{u}}_2$ are aligned with $\tilde{\mathbf{X}}\boldsymbol{\beta}$ and $(\tilde{\mathbf{X}}\boldsymbol{\beta})^{\circ 2}$, respectively. See Section \ref{['sec:numerical']} for the simulation details.
  • Figure 3: (Left, Middle) Training and test errors after one gradient as functions of $\log (\eta) / \log (n)$. (Right) A toy plot illustrating the theoretical training/test error curve as a function of $\log(\eta) / \log(n)$.

Theorems & Definitions (27)

  • Theorem 3.1: Spectrum of feature matrix
  • Theorem 3.2
  • Theorem 4.1: Training loss equivalence
  • Theorem 4.2: Test error equivalence
  • Theorem 4.3
  • Corollary 4.4
  • Theorem 4.5
  • Lemma C.1: Orthogonality of Hermite polynomials
  • proof
  • Lemma C.2: Taylor expansion of Hermite polynomials
  • ...and 17 more