Analysis of the rate of convergence of an over-parametrized convolutional neural network image classifier learned by gradient descent

Michael Kohler; Adam Krzyzak; Benjamin Walter

Analysis of the rate of convergence of an over-parametrized convolutional neural network image classifier learned by gradient descent

Michael Kohler, Adam Krzyzak, Benjamin Walter

TL;DR

This work analyzes the rate of convergence for an over-parameterized convolutional neural network image classifier trained by gradient descent in a binary classification setting. It proves a dimension-free bound on the excess misclassification risk under an average-pooling posterior model with parameter $\kappa$, showing ${\mathbf P}\{f_n(\mathbf{X}) \neq Y\} - {\mathbf P}\{f^*(\mathbf{X}) \neq Y\} \le c_6 \cdot n^{- \frac{1}{2\kappa^2+2} + \epsilon}$ for any $\epsilon>0$ when the network and training parameters are chosen appropriately; a truncated estimator for the posterior achieves near-minimax rates. The analysis leverages metric-entropy bounds and adapts techniques from prior work on over-parameterized networks to a convolutional architecture with parallel CNNs and a linear output layer. Overall, the results provide a theoretical explanation for the empirical success of gradient-descent trained, highly-parametrized CNNs by showing convergence rates that do not degrade with image dimension.

Abstract

Image classification based on over-parametrized convolutional neural networks with a global average-pooling layer is considered. The weights of the network are learned by gradient descent. A bound on the rate of convergence of the difference between the misclassification risk of the newly introduced convolutional neural network estimate and the minimal possible value is derived.

Analysis of the rate of convergence of an over-parametrized convolutional neural network image classifier learned by gradient descent

TL;DR

, showing

for any

when the network and training parameters are chosen appropriately; a truncated estimator for the posterior achieves near-minimax rates. The analysis leverages metric-entropy bounds and adapts techniques from prior work on over-parameterized networks to a convolutional architecture with parallel CNNs and a linear output layer. Overall, the results provide a theoretical explanation for the empirical success of gradient-descent trained, highly-parametrized CNNs by showing convergence rates that do not degrade with image dimension.

Abstract

Paper Structure (14 sections, 10 theorems, 171 equations)

This paper contains 14 sections, 10 theorems, 171 equations.

Introduction
Scope of this paper
Image classification
Convolutional neural networks
Main result
Discussion of related results
Notation
Outline
Definition of the estimate
Main result
Proofs
Auxiliary results
Proof of Theorem \ref{['th1']}
Acknowledgment

Key Result

Theorem 1

Let $d_1, d_2, \kappa \in \mathbb{N}$ with $\kappa \leq \min\{d_1,d_2\}$. Let $(\mathbf{X},Y)$, $(\mathbf{X}_1,Y_1)$, …, $(\mathbf{X}_n,Y_n)$ be independent and identically distributed $[0,1]^{\{1, \dots, d_1\} \times \{1, \dots, d_2\}} \times \{0,1\}$-valued random variables. Assume that the a post and $K_n \in \mathbb{N}$ such that and for some $\rho>0$ hold. Choose $L_n \in \mathbb{N}$ with

Theorems & Definitions (11)

Definition 1
Theorem 1
Lemma 1
Lemma 2
Lemma 3
Lemma 4
Lemma 5
Lemma 6
Lemma 7
Lemma 8
...and 1 more

Analysis of the rate of convergence of an over-parametrized convolutional neural network image classifier learned by gradient descent

TL;DR

Abstract

Analysis of the rate of convergence of an over-parametrized convolutional neural network image classifier learned by gradient descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (11)