Stochastic Gradient Descent for Two-layer Neural Networks

Dinghao Cao; Zheng-Chu Guo; Lei Shi

Stochastic Gradient Descent for Two-layer Neural Networks

Dinghao Cao, Zheng-Chu Guo, Lei Shi

TL;DR

This work analyzes SGD convergence for overparameterized two-layer neural networks in the NTK regime by embedding the training dynamics into RKHS associated with NTK kernels. It pairs NTK-based kernel approximations with RKHS convergence analysis to derive sharp last-iterate convergence rates under polynomial width growth and standard smoothness/capacity assumptions. The main results yield rates of the form $\mathbb{E}[\|g_{\Theta^{(T+1)}}-g_\rho\|_\rho^2]\le C\log(2/\delta)\big( T^{-\frac{2r}{2r+1}} + \varepsilon \big)$ for suitable choices of $\theta$, $\lambda$, and polynomially large width $M$, with an arbitrarily small $\varepsilon>0$; a corollary shows capacity-independent rates approaching the minimax optimum when the eigenvalue decay is favorable. The analysis connects kernel methods and deep network optimization, establishing that last-iterate SGD can achieve near-optimal generalization in overparameterized settings while avoiding exponential width, thus supporting scalable training under NTK-inspired dynamics.

Abstract

This paper presents a comprehensive study on the convergence rates of the stochastic gradient descent (SGD) algorithm when applied to overparameterized two-layer neural networks. Our approach combines the Neural Tangent Kernel (NTK) approximation with convergence analysis in the Reproducing Kernel Hilbert Space (RKHS) generated by NTK, aiming to provide a deep understanding of the convergence behavior of SGD in overparameterized two-layer neural networks. Our research framework enables us to explore the intricate interplay between kernel methods and optimization processes, shedding light on the optimization dynamics and convergence properties of neural networks. In this study, we establish sharp convergence rates for the last iterate of the SGD algorithm in overparameterized two-layer neural networks. Additionally, we have made significant advancements in relaxing the constraints on the number of neurons, which have been reduced from exponential dependence to polynomial dependence on the sample size or number of iterations. This improvement allows for more flexibility in the design and scaling of neural networks, and will deepen our theoretical understanding of neural network models trained with SGD.

Stochastic Gradient Descent for Two-layer Neural Networks

TL;DR

for suitable choices of

, and polynomially large width

, with an arbitrarily small

; a corollary shows capacity-independent rates approaching the minimax optimum when the eigenvalue decay is favorable. The analysis connects kernel methods and deep network optimization, establishing that last-iterate SGD can achieve near-optimal generalization in overparameterized settings while avoiding exponential width, thus supporting scalable training under NTK-inspired dynamics.

Abstract

Paper Structure (16 sections, 19 theorems, 115 equations, 2 algorithms)

This paper contains 16 sections, 19 theorems, 115 equations, 2 algorithms.

Introduction
Preliminary
Two-layer Neural Networks
Stochastic Gradient Descent
Neural Tangent Kernel
Main Results and Discussion
Proof of Main Results
Error Decomposition
Estimation For Dynamics Error
Estimation For Convergence Error
Estimation For Random Feature Error
Estimation For Approximation Error
Proof of Main Results
Appendix
Useful Lemmas
...and 1 more sections

Key Result

Theorem 1

Suppose Assumptions assumption activation function conditions, assumption uniform bounded, assumption regularity condition (with $\frac{1}{2}<r\le 1$) and assumption capacity condition (with $\beta>1$) hold. For any $\lambda > 0,$$T \in \mathbb{N}_+$. Run algorithm SGD with a polynomially decaying r the smallest integer not less than $x\in \mathbb{R}$ then there exists $M_0(T, \lambda) =\left\lcei

Theorems & Definitions (30)

Theorem 1
Corollary 1
Proposition 4.1
Lemma 4.1
proof
Proposition 4.2
proof
Proposition 4.3
proof
Proposition 4.4
...and 20 more

Stochastic Gradient Descent for Two-layer Neural Networks

TL;DR

Abstract

Stochastic Gradient Descent for Two-layer Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (30)