Table of Contents
Fetching ...

Provable Tempered Overfitting of Minimal Nets and Typical Nets

Itamar Harel, William M. Hoza, Gal Vardi, Itay Evron, Nathan Srebro, Daniel Soudry

TL;DR

These are the first theoretical results on benign or tempered overfitting that apply to deep NNs, and do not require a very high or very low input dimension.

Abstract

We study the overfitting behavior of fully connected deep Neural Networks (NNs) with binary weights fitted to perfectly classify a noisy training set. We consider interpolation using both the smallest NN (having the minimal number of weights) and a random interpolating NN. For both learning rules, we prove overfitting is tempered. Our analysis rests on a new bound on the size of a threshold circuit consistent with a partial function. To the best of our knowledge, ours are the first theoretical results on benign or tempered overfitting that: (1) apply to deep NNs, and (2) do not require a very high or very low input dimension.

Provable Tempered Overfitting of Minimal Nets and Typical Nets

TL;DR

These are the first theoretical results on benign or tempered overfitting that apply to deep NNs, and do not require a very high or very low input dimension.

Abstract

We study the overfitting behavior of fully connected deep Neural Networks (NNs) with binary weights fitted to perfectly classify a noisy training set. We consider interpolation using both the smallest NN (having the minimal number of weights) and a random interpolating NN. For both learning rules, we prove overfitting is tempered. Our analysis rests on a new bound on the size of a threshold circuit consistent with a partial function. To the best of our knowledge, ours are the first theoretical results on benign or tempered overfitting that: (1) apply to deep NNs, and (2) do not require a very high or very low input dimension.

Paper Structure

This paper contains 50 sections, 44 theorems, 222 equations, 4 figures, 1 algorithm.

Key Result

Theorem 3.1

Let $f \colon \{0, 1\}^{d_0} \to \{0, 1, \star\}$ be any function.When $f(\mathbf{x}) = \star$, the interpretation is that $f$ is "undefined" on $\mathbf{x}$, i.e., $f$ is a "partial" function. Let $N = |f^{-1}(\{0, 1\})|$ and $N_1 = |f^{-1}(1)|$. There exists a depth-$14$ binary threshold network $

Figures (4)

  • Figure 1: Types of overfitting behaviors. Consider a binary classification problem of learning a realizable distribution $\mathcal{D}_0$. Let $\mathcal{D}$ be the distribution induced by adding an $\varepsilon^{\star}$-probability for a data point's label to be flipped relative to $\mathcal{D}_0$. Suppose a model is trained with data from $\mathcal{D}$. Then, assuming the classes are balanced, the trivial generalization performance is $0.5$ (in gray; e.g., with a constant predictor). Left. Evaluating the model on $\mathcal{D}$, a Bayes-optimal hypothesis (in red) obtains a generalization error of $\varepsilon^{\star}$. For large enough training sets, our results (Section \ref{['sec:tempered']}) dictate a tempered overfitting behavior illustrated above. For arbitrary noise, the error is approximately bounded by ${1\!-\! {\varepsilon^{\star}}^{\varepsilon^{\star}} \left(1- \varepsilon^{\star}\right)^{1-\varepsilon^{\star}}}$ (blue). For independent noise, the error is concentrated around the tighter ${2 \varepsilon^{\star} \left(1- \varepsilon^{\star}\right)}$ ( yellow). A similar figure was previously shown in manoj2023interpolation for shortest-program interpolators. Right. Assuming independent noise, the left figure can be transformed into the error of the model on $\mathcal{D}_0$ (see Lemma \ref{['app-lem: indep noisy clean relation']}). The linear behavior in the independent setting ( yellow) is similar to the behavior observed empirically in mallinar2022benign.
  • Figure 2: Interpolating a dataset. To memorize the training set, we use a subset of the parameters to match those of the teacher and another subset to memorize the noise (label flips). Then, we "merge" these subsets to interpolate the noisy training set. In our figure, (1) blue edges represent weights identical to the teacher's; (2) yellow edges memorize the noise; (3) red edges are set to 0; and two additional layers implement the XOR between outputs, thus memorizing the training set.
  • Figure 3: Interpolating a dataset with an overparameterized student. We build on the construction from Figure \ref{['fig:network']} that memorizes a dataset using a subset of the parameters (blue, yellow, and red edges). Then, redundant neurons (gray) can be effectively ignored by setting their neuron scaling parameters ($\boldsymbol{\gamma}$) to 0, leaving the redundant weights (gray edges) unconstrained. Thus, the interpolation probability $p_{S}$ can be bounded by a quantity exponentially decaying in the number of neurons $n\left(\underline{d}\right)$ rather than in the number of weights $w \left(\underline{d}\right) = \omega \left(N\right)$.
  • Figure 4: Implementing a narrow network with a wider network. Blue edges represent parameters set to equal the parameters of $\bar{h}$, gray nodes represent zero neuron scaling, and gray edges represent unconstrained parameters.

Theorems & Definitions (106)

  • Definition 2.1: Binary threshold networks
  • Remark 2.2
  • Remark 2.3: Simple counting argument
  • Definition 2.5: Consistent datasets
  • Theorem 3.1: Memorizing the label flips
  • Remark 3.2: Dependence on $d_0$
  • Lemma 3.3: XOR of two NNs
  • Corollary 3.4: Memorizing a consistent dataset
  • Definition 4.1: Peak marginal probability
  • Theorem 4.2: Tempered overfitting of min-size NN interpolators
  • ...and 96 more