From Tempered to Benign Overfitting in ReLU Neural Networks

Guy Kornowski; Gilad Yehudai; Ohad Shamir

From Tempered to Benign Overfitting in ReLU Neural Networks

Guy Kornowski, Gilad Yehudai, Ohad Shamir

TL;DR

It is shown that the input dimension has a crucial role on the type of overfitting in this setting, which is validated empirically for intermediate dimensions and shed light on the intricate connections between the dimension, sample size, architecture and training algorithm on the one hand, and thetype of resulting overfitting on the other hand.

Abstract

Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on "benign overfitting", where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as "tempered overfitting", where the performance is non-optimal yet also non-trivial, and degrades as a function of the noise level. However, a theoretical justification of this claim for non-linear NNs has been lacking so far. In this work, we provide several results that aim at bridging these complementing views. We study a simple classification setting with 2-layer ReLU NNs, and prove that under various assumptions, the type of overfitting transitions from tempered in the extreme case of one-dimensional data, to benign in high dimensions. Thus, we show that the input dimension has a crucial role on the type of overfitting in this setting, which we also validate empirically for intermediate dimensions. Overall, our results shed light on the intricate connections between the dimension, sample size, architecture and training algorithm on the one hand, and the type of resulting overfitting on the other hand.

From Tempered to Benign Overfitting in ReLU Neural Networks

TL;DR

Abstract

Paper Structure (29 sections, 26 theorems, 88 equations, 3 figures)

This paper contains 29 sections, 26 theorems, 88 equations, 3 figures.

Introduction
Our contributions:
Related work
Preliminaries
Notation.
Setting.
Implicit bias.
Tempered overfitting in one dimension
Benign overfitting in high dimensions
Benign overfitting under KKT assumptions
The role of the bias and catastrophic overfitting in neural networks
Between tempered and benign overfitting for intermediate dimensions
Discussion
Proofs of tempered overfitting
Proof of Theorem \ref{['thm: one dim kkt tempered']}
...and 14 more sections

Key Result

Theorem 2.1

Under the setting above, if there exists some $t_0$ such that ${{\boldsymbol{\theta}}}(t_0)$ satisfies $\min_{i\in [m]} y_i N_{{{\boldsymbol{\theta}}}(t_0)}(\mathbf{x}_i)>0$, then $\frac{{{\boldsymbol{\theta}}}(t)}{\|{{\boldsymbol{\theta}}}(t)\|}\overset{t\to\infty}{\longrightarrow}\frac{{{\boldsymb

Figures (3)

Figure 1: Training a $2$-layer network with $1000$ neurons on $m$ samples drawn uniformly from $\mathbb{S}^{d-1}$ for varying input dimensions $d$. Each label is equal to $-1$ with probability $p$ and $+1$ with probability $(1-p)$. Left: $m=500$, Right: $m=2000$. The line corresponding to the identity function $y=x$ was added for reference. Best viewed in color.
Figure 2: Left: $2$-layer network with $n=50$ and $n=3000$ neurons and varying input dimension. Right: $3$-layer network with $n=1000$ neurons and varying input dimension. Both plots correspond to $m=500$ samples.
Figure 3: Illustration of the proof of Theorem \ref{['thm: one dim local tempered']} in case there is a single non-linearity along $[x_i,x_{i+1}]$. If the network is not linear along $[x_{i},x_{i+1}]$, one of the cases illustrated in the top row (in blue) must occur. In each case, the dashed green perturbation classifies correctly by altering exactly two neurons while reducing the parameter norm. Moreover, if the network is linear along $[x_{i},x_{i+1}]$, yet $N_{{\boldsymbol{\theta}}}(x_i)<-1$ or $N_{{{\boldsymbol{\theta}}}}(x_{i+1})>1$, one of the cases illustrated in the bottom row must occur. In either case, the dashed green perturbation classifies correctly by altering exactly two neurons while reducing the parameter norm.

Theorems & Definitions (43)

Theorem 2.1: Rephrased from lyu2020gradientji2020directional
Theorem 3.1
Theorem 3.2
Theorem 4.1
Remark 4.2: Data distribution
Theorem 4.3
Proposition 4.4: Catastrophic overfitting without bias
Proposition 4.5: Not benign without bias
Proposition 4.6: Benign overfitting does not follow from KKT without bias
Lemma A.1
...and 33 more

From Tempered to Benign Overfitting in ReLU Neural Networks

TL;DR

Abstract

From Tempered to Benign Overfitting in ReLU Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (43)