Table of Contents
Fetching ...

Analysis of the rate of convergence of an over-parametrized convolutional neural network image classifier learned by gradient descent

Michael Kohler, Adam Krzyzak, Benjamin Walter

TL;DR

This work analyzes the rate of convergence for an over-parameterized convolutional neural network image classifier trained by gradient descent in a binary classification setting. It proves a dimension-free bound on the excess misclassification risk under an average-pooling posterior model with parameter $\kappa$, showing ${\mathbf P}\{f_n(\mathbf{X}) \neq Y\} - {\mathbf P}\{f^*(\mathbf{X}) \neq Y\} \le c_6 \cdot n^{- \frac{1}{2\kappa^2+2} + \epsilon}$ for any $\epsilon>0$ when the network and training parameters are chosen appropriately; a truncated estimator for the posterior achieves near-minimax rates. The analysis leverages metric-entropy bounds and adapts techniques from prior work on over-parameterized networks to a convolutional architecture with parallel CNNs and a linear output layer. Overall, the results provide a theoretical explanation for the empirical success of gradient-descent trained, highly-parametrized CNNs by showing convergence rates that do not degrade with image dimension.

Abstract

Image classification based on over-parametrized convolutional neural networks with a global average-pooling layer is considered. The weights of the network are learned by gradient descent. A bound on the rate of convergence of the difference between the misclassification risk of the newly introduced convolutional neural network estimate and the minimal possible value is derived.

Analysis of the rate of convergence of an over-parametrized convolutional neural network image classifier learned by gradient descent

TL;DR

This work analyzes the rate of convergence for an over-parameterized convolutional neural network image classifier trained by gradient descent in a binary classification setting. It proves a dimension-free bound on the excess misclassification risk under an average-pooling posterior model with parameter , showing for any when the network and training parameters are chosen appropriately; a truncated estimator for the posterior achieves near-minimax rates. The analysis leverages metric-entropy bounds and adapts techniques from prior work on over-parameterized networks to a convolutional architecture with parallel CNNs and a linear output layer. Overall, the results provide a theoretical explanation for the empirical success of gradient-descent trained, highly-parametrized CNNs by showing convergence rates that do not degrade with image dimension.

Abstract

Image classification based on over-parametrized convolutional neural networks with a global average-pooling layer is considered. The weights of the network are learned by gradient descent. A bound on the rate of convergence of the difference between the misclassification risk of the newly introduced convolutional neural network estimate and the minimal possible value is derived.
Paper Structure (14 sections, 10 theorems, 171 equations)

This paper contains 14 sections, 10 theorems, 171 equations.

Key Result

Theorem 1

Let $d_1, d_2, \kappa \in \mathbb{N}$ with $\kappa \leq \min\{d_1,d_2\}$. Let $(\mathbf{X},Y)$, $(\mathbf{X}_1,Y_1)$, …, $(\mathbf{X}_n,Y_n)$ be independent and identically distributed $[0,1]^{\{1, \dots, d_1\} \times \{1, \dots, d_2\}} \times \{0,1\}$-valued random variables. Assume that the a post and $K_n \in \mathbb{N}$ such that and for some $\rho>0$ hold. Choose $L_n \in \mathbb{N}$ with

Theorems & Definitions (11)

  • Definition 1
  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • Lemma 8
  • ...and 1 more