Table of Contents
Fetching ...

On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent

Michael Kohler

TL;DR

This work analyzes nonparametric regression with random design, showing that an over-parameterized deep neural network trained by gradient descent with logistic activation can attain near-minimax convergence for $(p,C)$-smooth regression functions. The authors introduce a topology of $K_n$ parallel depth-$L$, width-$r$ networks and decompose the error into optimization, approximation, and generalization components, proving that with suitable initialization and learning rates the expected $L_2$ error decays as ${c \, n^{-\frac{2p}{2p+d}+\epsilon}}$ for any $\epsilon>0$. A key technical contribution is showing that the network remains within a bounded-complexity function space during training and that a bounded-weight network can approximate the target function at rate $K^{-p}$, enabling a rigorous rate bound. The results advance theoretical understanding of gradient-descent-trained deep nets in nonparametric regression, highlighting why over-parameterization does not necessarily entail overfitting when complexity is controlled and learning is properly tuned.

Abstract

Nonparametric regression with random design is considered. The $L_2$ error with integration with respect to the design measure is used as the error criterion. An over-parametrized deep neural network regression estimate with logistic activation function is defined, where all weights are learned by gradient descent. It is shown that the estimate achieves a nearly optimal rate of convergence in case that the regression function is $(p,C)$--smooth.

On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent

TL;DR

This work analyzes nonparametric regression with random design, showing that an over-parameterized deep neural network trained by gradient descent with logistic activation can attain near-minimax convergence for -smooth regression functions. The authors introduce a topology of parallel depth-, width- networks and decompose the error into optimization, approximation, and generalization components, proving that with suitable initialization and learning rates the expected error decays as for any . A key technical contribution is showing that the network remains within a bounded-complexity function space during training and that a bounded-weight network can approximate the target function at rate , enabling a rigorous rate bound. The results advance theoretical understanding of gradient-descent-trained deep nets in nonparametric regression, highlighting why over-parameterization does not necessarily entail overfitting when complexity is controlled and learning is properly tuned.

Abstract

Nonparametric regression with random design is considered. The error with integration with respect to the design measure is used as the error criterion. An over-parametrized deep neural network regression estimate with logistic activation function is defined, where all weights are learned by gradient descent. It is shown that the estimate achieves a nearly optimal rate of convergence in case that the regression function is --smooth.

Paper Structure

This paper contains 16 sections, 15 theorems, 262 equations.

Key Result

Theorem 1

Let $n \in \mathbb{N}$, let $(X,Y)$, $(X_1,Y_n)$, …, $(X_n,Y_n)$ be independent and identically distributed $\mathbb{R}^d \times \mathbb{R}$--valued random variables such that $supp(X)$ is bounded and that holds for some $c_4>0$. Let $p,C>0$ where $p=q+\beta$ for some $q \in \mathbb{N}_0$ and $\beta \in (0,1]$ with $p \geq 1/2$, and assume that the regression function $m:\mathbb{R}^d \rightarrow

Theorems & Definitions (17)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Theorem 2
  • Lemma 6
  • ...and 7 more