Table of Contents
Fetching ...

From Tempered to Benign Overfitting in ReLU Neural Networks

Guy Kornowski, Gilad Yehudai, Ohad Shamir

TL;DR

It is shown that the input dimension has a crucial role on the type of overfitting in this setting, which is validated empirically for intermediate dimensions and shed light on the intricate connections between the dimension, sample size, architecture and training algorithm on the one hand, and thetype of resulting overfitting on the other hand.

Abstract

Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on "benign overfitting", where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as "tempered overfitting", where the performance is non-optimal yet also non-trivial, and degrades as a function of the noise level. However, a theoretical justification of this claim for non-linear NNs has been lacking so far. In this work, we provide several results that aim at bridging these complementing views. We study a simple classification setting with 2-layer ReLU NNs, and prove that under various assumptions, the type of overfitting transitions from tempered in the extreme case of one-dimensional data, to benign in high dimensions. Thus, we show that the input dimension has a crucial role on the type of overfitting in this setting, which we also validate empirically for intermediate dimensions. Overall, our results shed light on the intricate connections between the dimension, sample size, architecture and training algorithm on the one hand, and the type of resulting overfitting on the other hand.

From Tempered to Benign Overfitting in ReLU Neural Networks

TL;DR

It is shown that the input dimension has a crucial role on the type of overfitting in this setting, which is validated empirically for intermediate dimensions and shed light on the intricate connections between the dimension, sample size, architecture and training algorithm on the one hand, and thetype of resulting overfitting on the other hand.

Abstract

Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on "benign overfitting", where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as "tempered overfitting", where the performance is non-optimal yet also non-trivial, and degrades as a function of the noise level. However, a theoretical justification of this claim for non-linear NNs has been lacking so far. In this work, we provide several results that aim at bridging these complementing views. We study a simple classification setting with 2-layer ReLU NNs, and prove that under various assumptions, the type of overfitting transitions from tempered in the extreme case of one-dimensional data, to benign in high dimensions. Thus, we show that the input dimension has a crucial role on the type of overfitting in this setting, which we also validate empirically for intermediate dimensions. Overall, our results shed light on the intricate connections between the dimension, sample size, architecture and training algorithm on the one hand, and the type of resulting overfitting on the other hand.
Paper Structure (29 sections, 26 theorems, 88 equations, 3 figures)

This paper contains 29 sections, 26 theorems, 88 equations, 3 figures.

Key Result

Theorem 2.1

Under the setting above, if there exists some $t_0$ such that ${{\boldsymbol{\theta}}}(t_0)$ satisfies $\min_{i\in [m]} y_i N_{{{\boldsymbol{\theta}}}(t_0)}(\mathbf{x}_i)>0$, then $\frac{{{\boldsymbol{\theta}}}(t)}{\|{{\boldsymbol{\theta}}}(t)\|}\overset{t\to\infty}{\longrightarrow}\frac{{{\boldsymb

Figures (3)

  • Figure 1: Training a $2$-layer network with $1000$ neurons on $m$ samples drawn uniformly from $\mathbb{S}^{d-1}$ for varying input dimensions $d$. Each label is equal to $-1$ with probability $p$ and $+1$ with probability $(1-p)$. Left: $m=500$, Right: $m=2000$. The line corresponding to the identity function $y=x$ was added for reference. Best viewed in color.
  • Figure 2: Left: $2$-layer network with $n=50$ and $n=3000$ neurons and varying input dimension. Right: $3$-layer network with $n=1000$ neurons and varying input dimension. Both plots correspond to $m=500$ samples.
  • Figure 3: Illustration of the proof of Theorem \ref{['thm: one dim local tempered']} in case there is a single non-linearity along $[x_i,x_{i+1}]$. If the network is not linear along $[x_{i},x_{i+1}]$, one of the cases illustrated in the top row (in blue) must occur. In each case, the dashed green perturbation classifies correctly by altering exactly two neurons while reducing the parameter norm. Moreover, if the network is linear along $[x_{i},x_{i+1}]$, yet $N_{{\boldsymbol{\theta}}}(x_i)<-1$ or $N_{{{\boldsymbol{\theta}}}}(x_{i+1})>1$, one of the cases illustrated in the bottom row must occur. In either case, the dashed green perturbation classifies correctly by altering exactly two neurons while reducing the parameter norm.

Theorems & Definitions (43)

  • Theorem 2.1: Rephrased from lyu2020gradientji2020directional
  • Theorem 3.1
  • Theorem 3.2
  • Theorem 4.1
  • Remark 4.2: Data distribution
  • Theorem 4.3
  • Proposition 4.4: Catastrophic overfitting without bias
  • Proposition 4.5: Not benign without bias
  • Proposition 4.6: Benign overfitting does not follow from KKT without bias
  • Lemma A.1
  • ...and 33 more