Analysis of the rate of convergence of an over-parametrized convolutional neural network image classifier learned by gradient descent
Michael Kohler, Adam Krzyzak, Benjamin Walter
TL;DR
This work analyzes the rate of convergence for an over-parameterized convolutional neural network image classifier trained by gradient descent in a binary classification setting. It proves a dimension-free bound on the excess misclassification risk under an average-pooling posterior model with parameter $\kappa$, showing ${\mathbf P}\{f_n(\mathbf{X}) \neq Y\} - {\mathbf P}\{f^*(\mathbf{X}) \neq Y\} \le c_6 \cdot n^{- \frac{1}{2\kappa^2+2} + \epsilon}$ for any $\epsilon>0$ when the network and training parameters are chosen appropriately; a truncated estimator for the posterior achieves near-minimax rates. The analysis leverages metric-entropy bounds and adapts techniques from prior work on over-parameterized networks to a convolutional architecture with parallel CNNs and a linear output layer. Overall, the results provide a theoretical explanation for the empirical success of gradient-descent trained, highly-parametrized CNNs by showing convergence rates that do not degrade with image dimension.
Abstract
Image classification based on over-parametrized convolutional neural networks with a global average-pooling layer is considered. The weights of the network are learned by gradient descent. A bound on the rate of convergence of the difference between the misclassification risk of the newly introduced convolutional neural network estimate and the minimal possible value is derived.
