How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks

Mo Zhou; Rong Ge

How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks

Mo Zhou, Rong Ge

TL;DR

This paper considers another mechanism for feature learning via gradient descent via gradient descent through a local convergence analysis and demonstrates that once the loss is below a certain threshold, gradient descent with a carefully regularized objective will capture ground-truth directions.

Abstract

The ability of learning useful features is one of the major advantages of neural networks. Although recent works show that neural network can operate in a neural tangent kernel (NTK) regime that does not allow feature learning, many works also demonstrate the potential for neural networks to go beyond NTK regime and perform feature learning. Recently, a line of work highlighted the feature learning capabilities of the early stages of gradient-based training. In this paper we consider another mechanism for feature learning via gradient descent through a local convergence analysis. We show that once the loss is below a certain threshold, gradient descent with a carefully regularized objective will capture ground-truth directions. We further strengthen this local convergence analysis by incorporating early-stage feature learning analysis. Our results demonstrate that feature learning not only happens at the initial gradient steps, but can also occur towards the end of training.

How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks

TL;DR

Abstract

Paper Structure (75 sections, 70 theorems, 198 equations, 3 figures, 1 algorithm)

This paper contains 75 sections, 70 theorems, 198 equations, 3 figures, 1 algorithm.

Introduction
Related works
Neural Tangent Kernel
Early stage feature learning
Learning single/multi-index models with neural networks
Local loss landscape
Preliminary
Notation
Teacher-student setup
Loss and algorithm
Main results
Proof overview
Stage 1
Stage 2
Stage 3
...and 60 more sections

Key Result

Theorem 1.1

If the data is generated by a 2-layer teacher network $f_*$, as long as the width of student network $m$ is at least some quantity $m_0$ that only depends on $f_*$, a variant of gradient descent algorithm (Algorithm alg:alg-1, roughly gradient descent with decreasing weight decay) can recover the ta

Figures (3)

Figure 1: Illustration of descent direction
Figure 2: Dual certificate $\eta$.
Figure 3: Test function $g$.

Theorems & Definitions (109)

Theorem 1.1: Informal
Theorem 3.1: Main result
Lemma 4.0: Stage 1
Lemma 4.0: Stage 2
Lemma 4.0: Stage 3
Lemma 4.0: Gradient lower bound
Lemma 5.1: Feature improvement descent direction, informal
Lemma 6.1: Informal
Definition 1: Non-degenerate dual certificate
Lemma 6.2
...and 99 more

How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks

TL;DR

Abstract

How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (109)