On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent

Michael Kohler

On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent

Michael Kohler

TL;DR

This work analyzes nonparametric regression with random design, showing that an over-parameterized deep neural network trained by gradient descent with logistic activation can attain near-minimax convergence for $(p,C)$-smooth regression functions. The authors introduce a topology of $K_n$ parallel depth-$L$, width-$r$ networks and decompose the error into optimization, approximation, and generalization components, proving that with suitable initialization and learning rates the expected $L_2$ error decays as ${c \, n^{-\frac{2p}{2p+d}+\epsilon}}$ for any $\epsilon>0$. A key technical contribution is showing that the network remains within a bounded-complexity function space during training and that a bounded-weight network can approximate the target function at rate $K^{-p}$, enabling a rigorous rate bound. The results advance theoretical understanding of gradient-descent-trained deep nets in nonparametric regression, highlighting why over-parameterization does not necessarily entail overfitting when complexity is controlled and learning is properly tuned.

Abstract

Nonparametric regression with random design is considered. The $L_2$ error with integration with respect to the design measure is used as the error criterion. An over-parametrized deep neural network regression estimate with logistic activation function is defined, where all weights are learned by gradient descent. It is shown that the estimate achieves a nearly optimal rate of convergence in case that the regression function is $(p,C)$--smooth.

On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent

TL;DR

-smooth regression functions. The authors introduce a topology of

parallel depth-

, width-

networks and decompose the error into optimization, approximation, and generalization components, proving that with suitable initialization and learning rates the expected

error decays as

for any

. A key technical contribution is showing that the network remains within a bounded-complexity function space during training and that a bounded-weight network can approximate the target function at rate

, enabling a rigorous rate bound. The results advance theoretical understanding of gradient-descent-trained deep nets in nonparametric regression, highlighting why over-parameterization does not necessarily entail overfitting when complexity is controlled and learning is properly tuned.

Abstract

Nonparametric regression with random design is considered. The

error with integration with respect to the design measure is used as the error criterion. An over-parametrized deep neural network regression estimate with logistic activation function is defined, where all weights are learned by gradient descent. It is shown that the estimate achieves a nearly optimal rate of convergence in case that the regression function is

--smooth.

On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent

TL;DR

Abstract

On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (17)