Table of Contents
Fetching ...

Learning of deep convolutional network image classifiers via stochastic gradient descent and over-parametrization

Michael Kohler, Adam Krzyzak, Alisha Sänger

TL;DR

This work analyzes image classification with deep CNNs trained by stochastic gradient descent in an over-parameterized regime. By formulating the estimator as a linear combination of truncated CNNs and applying SGD with a projection step, the authors derive excess-risk bounds that can be dimension-free under a hierarchical max-pooling model for the a posteriori probability. The core contribution is a general bound on the logistic risk for SGD-learned over-parameterized CNN ensembles, plus specialized rates for hierarchical models, including an improved rate under margin-type conditions. The results provide theoretical justification for dimension-independent learning performance on large-scale image datasets, connecting optimization dynamics, approximation power, and generalization through a unified framework. The work extends prior analyses to stochastic optimization and max-pooling hierarchies, with implications for understanding why gradient-based training of deep CNNs can generalize well in high-dimensional image spaces.

Abstract

Image classification from independent and identically distributed random variables is considered. Image classifiers are defined which are based on a linear combination of deep convolutional networks with max-pooling layer. Here all the weights are learned by stochastic gradient descent. A general result is presented which shows that the image classifiers are able to approximate the best possible deep convolutional network. In case that the a posteriori probability satisfies a suitable hierarchical composition model it is shown that the corresponding deep convolutional neural network image classifier achieves a rate of convergence which is independent of the dimension of the images.

Learning of deep convolutional network image classifiers via stochastic gradient descent and over-parametrization

TL;DR

This work analyzes image classification with deep CNNs trained by stochastic gradient descent in an over-parameterized regime. By formulating the estimator as a linear combination of truncated CNNs and applying SGD with a projection step, the authors derive excess-risk bounds that can be dimension-free under a hierarchical max-pooling model for the a posteriori probability. The core contribution is a general bound on the logistic risk for SGD-learned over-parameterized CNN ensembles, plus specialized rates for hierarchical models, including an improved rate under margin-type conditions. The results provide theoretical justification for dimension-independent learning performance on large-scale image datasets, connecting optimization dynamics, approximation power, and generalization through a unified framework. The work extends prior analyses to stochastic optimization and max-pooling hierarchies, with implications for understanding why gradient-based training of deep CNNs can generalize well in high-dimensional image spaces.

Abstract

Image classification from independent and identically distributed random variables is considered. Image classifiers are defined which are based on a linear combination of deep convolutional networks with max-pooling layer. Here all the weights are learned by stochastic gradient descent. A general result is presented which shows that the image classifiers are able to approximate the best possible deep convolutional network. In case that the a posteriori probability satisfies a suitable hierarchical composition model it is shown that the corresponding deep convolutional neural network image classifier achieves a rate of convergence which is independent of the dimension of the images.
Paper Structure (28 sections, 26 theorems, 437 equations)

This paper contains 28 sections, 26 theorems, 437 equations.

Key Result

Theorem 1

Let $(X,Y)$, $(X_1,Y_1), \ldots, (X_n,Y_n)$ be independent and identically distributed random variables with values in $[0,1]^{d_1 \times d_2} \times \{-1,1\}$. Let $N_n,I_n, t_n \in \mathbb{N}$ and let $C_n,D_n \geq 0$. Set $\beta_n = c_3 \cdot \log n$, and define the estimate $f_n$ as above. Assume that there exists $\tilde{\vartheta}\in \mathbf{\Theta}^0$, such that $f_{\tilde{\vartheta}}(X)=0

Theorems & Definitions (28)

  • Theorem 1
  • Definition 1
  • Definition 2
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • ...and 18 more