Table of Contents
Fetching ...

Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification

Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, Yiming Ying

TL;DR

This work establishes near-optimal generalization rates for gradient descent trained deep ReLU networks under NTK-separable data with margin $\gamma$, achieving a risk bound of $\widetilde{O}\Big(\dfrac{L^4(1+\gamma L^2)}{\gamma^2 n}\Big)$ up to depth-dependent factors. It introduces two key innovations: (i) refined control of activation patterns near a reference model to sharpen Rademacher complexity bounds, and (ii) a covering-number based approach that yields a polynomial-in-depth Lipschitz bound near initialization. Together, these enable a generalization guarantee that matches kernel-type rates while relaxing overparameterization to a polynomial scale in depth $L$ and polylogarithmic width. The NTK-separable data analysis further delivers an optimal $\widetilde{O}(1/(n\gamma^2))$ risk with only polynomial depth, and experiments on 2-XOR data corroborate the predicted scaling. This work advances theoretical understanding of GD in deep ReLU networks and guides practical considerations for width and depth in overparameterized regimes.

Abstract

Recent advances have significantly improved our understanding of the generalization performance of gradient descent (GD) methods in deep neural networks. A natural and fundamental question is whether GD can achieve generalization rates comparable to the minimax optimal rates established in the kernel setting. Existing results either yield suboptimal rates of $O(1/\sqrt{n})$, or focus on networks with smooth activation functions, incurring exponential dependence on network depth $L$. In this work, we establish optimal generalization rates for GD with deep ReLU networks by carefully trading off optimization and generalization errors, achieving only polynomial dependence on depth. Specifically, under the assumption that the data are NTK separable from the margin $γ$, we prove an excess risk rate of $\widetilde{O}(L^4 (1 + γL^2) / (n γ^2))$, which aligns with the optimal SVM-type rate $\widetilde{O}(1 / (n γ^2))$ up to depth-dependent factors. A key technical contribution is our novel control of activation patterns near a reference model, enabling a sharper Rademacher complexity bound for deep ReLU networks trained with gradient descent.

Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification

TL;DR

This work establishes near-optimal generalization rates for gradient descent trained deep ReLU networks under NTK-separable data with margin , achieving a risk bound of up to depth-dependent factors. It introduces two key innovations: (i) refined control of activation patterns near a reference model to sharpen Rademacher complexity bounds, and (ii) a covering-number based approach that yields a polynomial-in-depth Lipschitz bound near initialization. Together, these enable a generalization guarantee that matches kernel-type rates while relaxing overparameterization to a polynomial scale in depth and polylogarithmic width. The NTK-separable data analysis further delivers an optimal risk with only polynomial depth, and experiments on 2-XOR data corroborate the predicted scaling. This work advances theoretical understanding of GD in deep ReLU networks and guides practical considerations for width and depth in overparameterized regimes.

Abstract

Recent advances have significantly improved our understanding of the generalization performance of gradient descent (GD) methods in deep neural networks. A natural and fundamental question is whether GD can achieve generalization rates comparable to the minimax optimal rates established in the kernel setting. Existing results either yield suboptimal rates of , or focus on networks with smooth activation functions, incurring exponential dependence on network depth . In this work, we establish optimal generalization rates for GD with deep ReLU networks by carefully trading off optimization and generalization errors, achieving only polynomial dependence on depth. Specifically, under the assumption that the data are NTK separable from the margin , we prove an excess risk rate of , which aligns with the optimal SVM-type rate up to depth-dependent factors. A key technical contribution is our novel control of activation patterns near a reference model, enabling a sharper Rademacher complexity bound for deep ReLU networks trained with gradient descent.

Paper Structure

This paper contains 20 sections, 25 theorems, 181 equations, 3 tables.

Key Result

Theorem 1

Let Assumptions ass:init, ass:input hold. If $m\gtrsim L^{16}(\log m)^4\log(nL/\delta)F^4_S(\overline{\mathbf{W}}), \eta \leq \min\{4/(5L),1/(20L\tilde{F}_S(\overline{\mathbf{W}}))\}$, then with probability at least $1-\delta$, for all $t\le T$ we have

Theorems & Definitions (53)

  • Definition 1: Gradient Descent
  • Theorem 1
  • Remark 1
  • Definition 2: Rademacher complexity
  • Remark 2: Improved Rademacher complexity
  • Lemma 1: srebro2010smoothness
  • Theorem 2
  • Remark 3: Analysis of Lipschitzness
  • Theorem 3
  • Remark 4: Proof sketch
  • ...and 43 more