Table of Contents
Fetching ...

Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting

Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran

TL;DR

The paper addresses why interpolating models, including neural networks, can generalize despite fitting noise. It introduces tempered overfitting as an intermediate regime between benign and catastrophic and develops a kernel-regression trichotomy showing how ridge regularization and spectral decay govern regime type, with powerlaw spectra yielding tempered behavior and fast decay risking catastrophe. Empirically, interpolating DNNs exhibit tempered overfitting while early-stopped networks tend toward benign fitting, and kernel experiments with Laplace/NTK spectra corroborate the theory. The work refines the understanding of generalization in modern interpolators and provides a principled framework to analyze and control overfitting via spectral properties and training procedures.

Abstract

The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied benign overfitting, a phenomenon where some interpolating methods approach Bayes optimality, even in the presence of noise. In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime. We call this intermediate regime tempered overfitting, and we initiate its systematic study. We first explore this phenomenon in the context of kernel (ridge) regression (KR) by obtaining conditions on the ridge parameter and kernel eigenspectrum under which KR exhibits each of the three behaviors. We find that kernels with powerlaw spectra, including Laplace kernels and ReLU neural tangent kernels, exhibit tempered overfitting. We then empirically study deep neural networks through the lens of our taxonomy, and find that those trained to interpolation are tempered, while those stopped early are benign. We hope our work leads to a more refined understanding of overfitting in modern learning.

Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting

TL;DR

The paper addresses why interpolating models, including neural networks, can generalize despite fitting noise. It introduces tempered overfitting as an intermediate regime between benign and catastrophic and develops a kernel-regression trichotomy showing how ridge regularization and spectral decay govern regime type, with powerlaw spectra yielding tempered behavior and fast decay risking catastrophe. Empirically, interpolating DNNs exhibit tempered overfitting while early-stopped networks tend toward benign fitting, and kernel experiments with Laplace/NTK spectra corroborate the theory. The work refines the understanding of generalization in modern interpolators and provides a principled framework to analyze and control overfitting via spectral properties and training procedures.

Abstract

The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied benign overfitting, a phenomenon where some interpolating methods approach Bayes optimality, even in the presence of noise. In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime. We call this intermediate regime tempered overfitting, and we initiate its systematic study. We first explore this phenomenon in the context of kernel (ridge) regression (KR) by obtaining conditions on the ridge parameter and kernel eigenspectrum under which KR exhibits each of the three behaviors. We find that kernels with powerlaw spectra, including Laplace kernels and ReLU neural tangent kernels, exhibit tempered overfitting. We then empirically study deep neural networks through the lens of our taxonomy, and find that those trained to interpolation are tempered, while those stopped early are benign. We hope our work leads to a more refined understanding of overfitting in modern learning.
Paper Structure (37 sections, 1 theorem, 23 equations, 11 figures, 2 tables)

This paper contains 37 sections, 1 theorem, 23 equations, 11 figures, 2 tables.

Key Result

Theorem 3.1

For $\{ \lambda_i \}_{i=1}^\infty$ and $\{v_i\}_{i=1}^\infty$ satisfying Assumption assumption:vals_and_coeffs, $\sigma^2 > 0$, and $\mathcal{E}_n$ given by Eq. eqn:eigenlearning_repr,

Figures (11)

  • Figure 1: As $n \rightarrow \infty$, interpolating methods can exhibit three types of overfitting.(A) In benign overfitting, the predictor asymptotically approaches the ground-truth, Bayes-optimal function. Nadaraya-Watson kernel smoothing with a singular kernel, shown here, is asymptotically benign. (B) In tempered overfitting, the regime studied in this work, the predictor approaches a constant test risk greater than the Bayes-optimal risk. Piecewise-linear interpolation is asymptotically tempered. (C) In catastrophic overfitting, the predictor generalizes arbitrarily poorly. Rank-$n$ polynomial interpolation is asymptotically catastrophic.
  • Figure 2: DNNs trained on image data exhibit tempered overfitting, not benign overfitting. Curves show test classification error vs. training label flip probability for a Wide ResNet $28 \times 10$ trained to interpolation on binary CIFAR-10 (animals vs. vehicles) for different training sizes $n$. These curves are noise profiles, as discussed in Section \ref{['sec:prelims']}.
  • Figure 3: Examples of Benign and Tempered Fitting. Noise profiles for several different methods on the Binary-MNIST classification task, showing clean test error as a function of train label noise as the train set $n$ grows. Left: two methods which exhibit benign overfitting, with performance converging to Bayes optimal as $n\to \infty$. Right: two methods which exhibit tempered overfitting, with test error that remains bounded away from $0$. The two "tempered" methods here are interpolating, while the two "benign" methods are not. Both the benign and the tempered MLP use identical architectures; one is trained for one epoch, and the other trained to interpolation. Details in Appendix \ref{['apdx:experimental_details']}.
  • Figure 4: Kernel regression can exhibit all three fitting regimes with proper choice of ridge parameter and kernel. Plots show learning curves for KR with data $\{x_i\}$ sampled uniformly from the unit sphere $\mathcal{S}^{d-1}$, trained with pure noise target labels $y_i \sim \mathcal{N}(0,1)$. Test MSE is computed with respect to a clean test set. (a) KR with a Gaussian kernel and nonzero ridge is asymptotically benign. A ridge value of $\delta = 0.1$ was used. (b) Ridgeless KR with a Laplace kernel exhibits tempered overfitting. (c) Ridgeless KR with a Gaussian kernel exhibits catastrophic overfitting.
  • Figure 5: As $n$ grows, the MSE of KR with Gaussian eigenfunctions and powerlaw kernel eigenspectra with exponent $\alpha$ approaches $\alpha \sigma^2$.(a-c): learning curves with different $\alpha$. (d): test MSE at $n = 1024$ for varying $\alpha$, with the identity function shown by the solid line.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Theorem 3.1: KR trichotomy