Statistically guided deep learning
Michael Kohler, Adam Krzyzak
TL;DR
This work tackles nonparametric regression with deep networks by developing a theory-guided, over-parameterized architecture that combines a parallel ensemble of depth-$L$ networks with a linear readout. It introduces data-driven initializations and a principled, adaptive scheme for selecting the learning rate and number of gradient steps, yielding provable $L_2$ convergence rates that match the minimax rate for $(p,C)$-smooth regression up to an arbitrarily small $oldsymbol psilon$. The main contributions include a general error bound decomposing into approximation and estimation terms, a practical algorithm for tuning hyperparameters, and empirical evidence showing favorable finite-sample performance on simulated univariate data, often rivaling smoothing splines. The results demonstrate that theoretical analysis can guide the design of deep-learning estimators with improved finite-sample behavior, potentially extending to higher dimensions and more complex function classes.
Abstract
We present a theoretically well-founded deep learning algorithm for nonparametric regression. It uses over-parametrized deep neural networks with logistic activation function, which are fitted to the given data via gradient descent. We propose a special topology of these networks, a special random initialization of the weights, and a data-dependent choice of the learning rate and the number of gradient descent steps. We prove a theoretical bound on the expected $L_2$ error of this estimate, and illustrate its finite sample size performance by applying it to simulated data. Our results show that a theoretical analysis of deep learning which takes into account simultaneously optimization, generalization and approximation can result in a new deep learning estimate which has an improved finite sample performance.
