Table of Contents
Fetching ...

Rethinking generalization of classifiers in separable classes scenarios and over-parameterized regimes

Julius Martinetz, Christoph Linse, Thomas Martinetz

TL;DR

It is shown that in separable classes scenarios the proportion of "bad" global minima diminishes exponentially with the number of training data n, which may shed light on the unexpectedly good generalization of over-parameterized Neural Networks.

Abstract

We investigate the learning dynamics of classifiers in scenarios where classes are separable or classifiers are over-parameterized. In both cases, Empirical Risk Minimization (ERM) results in zero training error. However, there are many global minima with a training error of zero, some of which generalize well and some of which do not. We show that in separable classes scenarios the proportion of "bad" global minima diminishes exponentially with the number of training data n. Our analysis provides bounds and learning curves dependent solely on the density distribution of the true error for the given classifier function set, irrespective of the set's size or complexity (e.g., number of parameters). This observation may shed light on the unexpectedly good generalization of over-parameterized Neural Networks. For the over-parameterized scenario, we propose a model for the density distribution of the true error, yielding learning curves that align with experiments on MNIST and CIFAR-10.

Rethinking generalization of classifiers in separable classes scenarios and over-parameterized regimes

TL;DR

It is shown that in separable classes scenarios the proportion of "bad" global minima diminishes exponentially with the number of training data n, which may shed light on the unexpectedly good generalization of over-parameterized Neural Networks.

Abstract

We investigate the learning dynamics of classifiers in scenarios where classes are separable or classifiers are over-parameterized. In both cases, Empirical Risk Minimization (ERM) results in zero training error. However, there are many global minima with a training error of zero, some of which generalize well and some of which do not. We show that in separable classes scenarios the proportion of "bad" global minima diminishes exponentially with the number of training data n. Our analysis provides bounds and learning curves dependent solely on the density distribution of the true error for the given classifier function set, irrespective of the set's size or complexity (e.g., number of parameters). This observation may shed light on the unexpectedly good generalization of over-parameterized Neural Networks. For the over-parameterized scenario, we propose a model for the density distribution of the true error, yielding learning curves that align with experiments on MNIST and CIFAR-10.

Paper Structure

This paper contains 7 sections, 26 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Classification problem in 2D with two (linearly) separable classes. The red and blue dots show 100000 random data points of both classes (A). The same data points classified with a polynomial classifier of degree 10 after training with 2 (B) and 20 (C) training samples.
  • Figure 2: Distribution of the test errors (red crosses) for different numbers of training data on the classification problem shown in Figure \ref{['fig:ToyProblem']}; for a linear classifier (A) and a polynomial classifier of degree 10 (C). The green line in A shows the 25% bound given by inequality (\ref{['eq:epsilonbound']}) with $R=98$ and the blue line in A and C, respectively, with $R$ chosen such that the bound becomes tight. At the bottom, the red lines show the fractions of test errors exceeding $0.1$ and $0.05$, respectively, in a logarithmic plot, for the linear classifier (B) and for the polynomial classifier of degree 10 (D). The green lines in B are the bound (\ref{['eq:ourboundexp']}) with $R=98$ and the blue lines ind B and D with $R$ chosen such that the bound becomes tight, demonstrating the exponential decrease of the fraction of solutions with test errors exceeding a given $\varepsilon$.
  • Figure 3: $Q_n(E)$ with $E_{\textrm{min}} = 0.1$, $\alpha=57$ and $\beta=8$. For $n=0$ we obtain $D(E)$ with its maximum at $E=0.9$. For large $n$ we had to cut the tips of the curves.
  • Figure 4: Mean test errors and their standard deviations of ResNet018 (top) and ResNet101 (bottom) on MNIST for different numbers of trainings samples $n$. The green crosses show their double logarithmic values as deviation from $E_{min}$. The red curves show fits of Eq. \ref{['eq:modelerror']} (parameters given in Table \ref{['tab:modelerror_parameters']}).
  • Figure 5: Mean test errors and their standard deviations of ResNet018 and MLP3 (top) and ResNet101 and MLP8 (bottom) on Cifar-10 for different numbers of trainings samples $n$. The red curves show fits of Eq. \ref{['eq:modelerror']} (parameters given in Table \ref{['tab:modelerror_parameters']}).