Table of Contents
Fetching ...

Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension

Moritz Haas, David Holzmüller, Ulrike von Luxburg, Ingo Steinwart

TL;DR

This paper generalizes existing inconsistency results to non-interpolating models and more kernels to show that benign overfitting with moderate derivatives is impossible in fixed dimension, and shows that rate-optimal benign over fitting is possible for regression with a sequence of spiky-smooth kernels with large derivatives.

Abstract

The success of over-parameterized neural networks trained to near-zero training error has caused great interest in the phenomenon of benign overfitting, where estimators are statistically consistent even though they interpolate noisy training data. While benign overfitting in fixed dimension has been established for some learning methods, current literature suggests that for regression with typical kernel methods and wide neural networks, benign overfitting requires a high-dimensional setting where the dimension grows with the sample size. In this paper, we show that the smoothness of the estimators, and not the dimension, is the key: benign overfitting is possible if and only if the estimator's derivatives are large enough. We generalize existing inconsistency results to non-interpolating models and more kernels to show that benign overfitting with moderate derivatives is impossible in fixed dimension. Conversely, we show that rate-optimal benign overfitting is possible for regression with a sequence of spiky-smooth kernels with large derivatives. Using neural tangent kernels, we translate our results to wide neural networks. We prove that while infinite-width networks do not overfit benignly with the ReLU activation, this can be fixed by adding small high-frequency fluctuations to the activation function. Our experiments verify that such neural networks, while overfitting, can indeed generalize well even on low-dimensional data sets.

Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension

TL;DR

This paper generalizes existing inconsistency results to non-interpolating models and more kernels to show that benign overfitting with moderate derivatives is impossible in fixed dimension, and shows that rate-optimal benign over fitting is possible for regression with a sequence of spiky-smooth kernels with large derivatives.

Abstract

The success of over-parameterized neural networks trained to near-zero training error has caused great interest in the phenomenon of benign overfitting, where estimators are statistically consistent even though they interpolate noisy training data. While benign overfitting in fixed dimension has been established for some learning methods, current literature suggests that for regression with typical kernel methods and wide neural networks, benign overfitting requires a high-dimensional setting where the dimension grows with the sample size. In this paper, we show that the smoothness of the estimators, and not the dimension, is the key: benign overfitting is possible if and only if the estimator's derivatives are large enough. We generalize existing inconsistency results to non-interpolating models and more kernels to show that benign overfitting with moderate derivatives is impossible in fixed dimension. Conversely, we show that rate-optimal benign overfitting is possible for regression with a sequence of spiky-smooth kernels with large derivatives. Using neural tangent kernels, we translate our results to wide neural networks. We prove that while infinite-width networks do not overfit benignly with the ReLU activation, this can be fixed by adding small high-frequency fluctuations to the activation function. Our experiments verify that such neural networks, while overfitting, can indeed generalize well even on low-dimensional data sets.
Paper Structure (53 sections, 40 theorems, 112 equations, 19 figures)

This paper contains 53 sections, 40 theorems, 112 equations, 19 figures.

Key Result

Theorem 1

Let assumptions (D1) and (K) hold. Let $c_{\mathrm{fit}} \in (0, 1]$ and $C_{\mathrm{norm}} > 0$. Then, there exist $c > 0$ and $n_0 \in \mathbb{N}$ such that the following holds for all $n \geq n_0$ with probability $1-O(1/n)$ over the draw of the data set $D$ with $n$ samples: Every function $f \i has an excess risk that satisfies

Figures (19)

  • Figure 1: Spiky-smooth overfitting in 2 dimensions.a. We plot the predicted function for ridgeless kernel regression with the Laplace kernel (blue) versus our spiky-smooth kernel \ref{['eq:def_spsm_kernel']} with Laplace components (orange) on $\mathbb{S}^1$. The dashed black line shows the true regression function, black 'x' denote noisy training points. Further details can be found in \ref{['sec:exp_finite_n']}. b. The predicted function of a trained 2-layer neural network with ReLU activation (blue) versus ReLU plus shifted high-frequency $\sin$-function \ref{['eq:spsm_activ_ntk']} (orange). Using the weights learned with the spiky-smooth activation function in a ReLU network (green) disentangles the spike component from the signal component. c. Training error (solid lines) and test error (dashed lines) over the course of training for b. evaluated on $10^4$ test points. The dotted black line shows the optimal test error. The spiky-smooth activation function does not require regularization and can simply be trained to overfit.
  • Figure 2: The spiky-smooth kernel with Laplace components (orange) consists of a Laplace kernel (blue) plus a Laplace kernel of height $\rho$ and small bandwidth $\gamma$.
  • Figure 3: a., b. Gaussian NTK activation components $\phi_{NTK}^{\check{k}_{\gamma}}$ defined via \ref{['eq:activ_def']} induced by the Gaussian kernel with varying bandwidth $\gamma\in [0.2,0.1,0.05]$ (the darker, the smaller $\gamma$) for a. bi-alternating signs $s_i=+1$ iff $\lfloor {i}/{2}\rfloor$ even, and b. randomly iid chosen signs $s_i\sim \mathcal{U}(\{-1,+1\})$. c. Coefficients of the Hermite series of a Gaussian NTK activation component with varying bandwidth $\gamma$. Observe peaks at $2/\gamma$. For reliable approximations of activation functions use a truncation $\geq 4/\gamma$. The sum of squares of the coefficients follows Eq. \ref{['eq:l2norm_activationfct']}. \ref{['fig:3_nngp']} visualizes NNGP activation components.
  • Figure I.1: a. The $ReLU$-component $f_{ReLU}$ (blue) and the full spiky-smooth network $f_{spsm}$ (orange) of the learned neural network from \ref{['fig:nn_training']}. b. The spike component $f_{spikes}$ of the learned neural network from \ref{['fig:nn_training']} against the label noise in the training set, derived by subtracting the signal from the training points. Observe that the $ReLU$-component has learned the signal, while the spike component has fitted the noise in the data while regressing to $0$ between data points.
  • Figure I.2: Here we plot the functions learned by $12$ random hidden layer neurons of the spike component network $f_{spikes}$ corresponding to \ref{['fig:nn_training']}.
  • ...and 14 more figures

Theorems & Definitions (82)

  • Theorem 1: Overfitting estimators with small norms are inconsistent
  • Proposition 2: Popular estimators fulfill the norm bound (N)
  • Remark 3: Dimension dependency
  • Theorem 4: Overfitting with neural network kernels in fixed dimension is inconsistent
  • Proposition 4: Spectral lower bound
  • Theorem 5: Inconsistency for Sobolev dot-product kernels on the sphere
  • Definition 6: Spiky-smooth kernel
  • Theorem 7: Consistency of spiky-smooth ridgeless kernel regression
  • Remark 8: Benign overfitting with optimal convergence rates
  • Remark 9: Interplay between smoothness and dimensionality
  • ...and 72 more