On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent
Michael Kohler
TL;DR
This work analyzes nonparametric regression with random design, showing that an over-parameterized deep neural network trained by gradient descent with logistic activation can attain near-minimax convergence for $(p,C)$-smooth regression functions. The authors introduce a topology of $K_n$ parallel depth-$L$, width-$r$ networks and decompose the error into optimization, approximation, and generalization components, proving that with suitable initialization and learning rates the expected $L_2$ error decays as ${c \, n^{-\frac{2p}{2p+d}+\epsilon}}$ for any $\epsilon>0$. A key technical contribution is showing that the network remains within a bounded-complexity function space during training and that a bounded-weight network can approximate the target function at rate $K^{-p}$, enabling a rigorous rate bound. The results advance theoretical understanding of gradient-descent-trained deep nets in nonparametric regression, highlighting why over-parameterization does not necessarily entail overfitting when complexity is controlled and learning is properly tuned.
Abstract
Nonparametric regression with random design is considered. The $L_2$ error with integration with respect to the design measure is used as the error criterion. An over-parametrized deep neural network regression estimate with logistic activation function is defined, where all weights are learned by gradient descent. It is shown that the estimate achieves a nearly optimal rate of convergence in case that the regression function is $(p,C)$--smooth.
