Table of Contents
Fetching ...

Beyond Benign Overfitting in Nadaraya-Watson Interpolators

Daniel Barzilai, Guy Kornowski, Ohad Shamir

TL;DR

This work analyzes the interpolating Nadaraya-Watson classifier with predictor $\\hat{h}_\\beta(\\mathbf{x}) = \\text{sign}\\left(\\sum_{i=1}^m \\frac{y_i}{\\|\\mathbf{x}-\\mathbf{x}_i\\|^{\\beta}}\\right)$. By varying the bandwidth parameter $\\beta$ relative to the ambient dimension $d$, the authors prove a tripartite generalization behavior: catastrophic overfitting for $\\beta<d$, benign overfitting at $\\beta=d$, and tempered overfitting for $\\beta>d$, with clean error scales characterized in terms of $p$ and logarithmic factors. They further argue that the optimal $\\beta$ aligns with the intrinsic dimension $d_{\\text{int}}$ of the data, implying that over-estimating $d_{\\text{int}}$ is often safer than under-estimating it; these insights are supported by experiments on synthetic data and real data like MNIST. Overall, the paper extends classic NW analysis by revealing rich, non-monotone generalization phenomena in a simple interpolating rule, and provides practical guidance for tuning kernel parameters in light of data geometry.

Abstract

In recent years, there has been much interest in understanding the generalization behavior of interpolating predictors, which overfit on noisy training data. Whereas standard analyses are concerned with whether a method is consistent or not, recent observations have shown that even inconsistent predictors can generalize well. In this work, we revisit the classic interpolating Nadaraya-Watson (NW) estimator (also known as Shepard's method), and study its generalization capabilities through this modern viewpoint. In particular, by varying a single bandwidth-like hyperparameter, we prove the existence of multiple overfitting behaviors, ranging non-monotonically from catastrophic, through benign, to tempered. Our results highlight how even classical interpolating methods can exhibit intricate generalization behaviors. In addition, for the purpose of tuning the hyperparameter, the results suggest that over-estimating the intrinsic dimension of the data is less harmful than under-estimating it. Numerical experiments complement our theory, demonstrating the same phenomena.

Beyond Benign Overfitting in Nadaraya-Watson Interpolators

TL;DR

This work analyzes the interpolating Nadaraya-Watson classifier with predictor . By varying the bandwidth parameter relative to the ambient dimension , the authors prove a tripartite generalization behavior: catastrophic overfitting for , benign overfitting at , and tempered overfitting for , with clean error scales characterized in terms of and logarithmic factors. They further argue that the optimal aligns with the intrinsic dimension of the data, implying that over-estimating is often safer than under-estimating it; these insights are supported by experiments on synthetic data and real data like MNIST. Overall, the paper extends classic NW analysis by revealing rich, non-monotone generalization phenomena in a simple interpolating rule, and provides practical guidance for tuning kernel parameters in light of data geometry.

Abstract

In recent years, there has been much interest in understanding the generalization behavior of interpolating predictors, which overfit on noisy training data. Whereas standard analyses are concerned with whether a method is consistent or not, recent observations have shown that even inconsistent predictors can generalize well. In this work, we revisit the classic interpolating Nadaraya-Watson (NW) estimator (also known as Shepard's method), and study its generalization capabilities through this modern viewpoint. In particular, by varying a single bandwidth-like hyperparameter, we prove the existence of multiple overfitting behaviors, ranging non-monotonically from catastrophic, through benign, to tempered. Our results highlight how even classical interpolating methods can exhibit intricate generalization behaviors. In addition, for the purpose of tuning the hyperparameter, the results suggest that over-estimating the intrinsic dimension of the data is less harmful than under-estimating it. Numerical experiments complement our theory, demonstrating the same phenomena.

Paper Structure

This paper contains 24 sections, 15 theorems, 63 equations, 6 figures.

Key Result

Theorem 1.1

Suppose $\mathcal{D}_\mathbf{x}$ has a density on $\mathbb{R}^d$, and let $\beta=d$. For any noise level $p\in(0,0.49)$, it holds that the clean classification error of $\hat{h}_\beta$ goes to zero as $m\to\infty$, i.e. $\hat{h}_\beta$ exhibits benign overfitting.

Figures (6)

  • Figure 2: Illustration of the lower bound construction used in the proof of Theorem \ref{['thm:catastrophic']}. When $\beta<d$, the inner circle will be misclassified as $+1$ with high probability, inducing constant error.
  • Figure 3: The classification error of $\hat{h}_\beta$ for varying values of $\beta$, with data in dimension $d=1$ given by Eq. (\ref{['eq: data 1d']}). On the left, $m=2000$ is fixed, $p$ varies. On the right, $p=0.04$ is fixed, $m$ varies. Best viewed in color.
  • Figure 4: The classification error of $\hat{h}_\beta$ for varying values of $\beta$, with data on $\mathbb{S}^2\subset\mathbb{R}^3$ given by Eq. (\ref{['eq: data 2d']}). On the left, $m=2000$ is fixed, $p$ varies. On the right, $p=0.04$ is fixed, $m$ varies. Best viewed in color.
  • Figure 5: The classification error of $\hat{h}_\beta$ for varying values of $\beta$, with respect to MNIST's $0/1$ data. On the left, $m$ is fixed to the entire train set, $p$ varies. On the right, $p=0.1$ is fixed, $m$ varies. Best viewed in color.
  • Figure 6: The classification error of $\hat{h}_\beta$ for varying values of $\beta$ and sampling noise $\sigma^2$. Best viewed in color.
  • ...and 1 more figures

Theorems & Definitions (26)

  • Theorem 1.1: devroye1998hilbert
  • Theorem 1.2: Main results, informal
  • Theorem 4.1
  • proof : Proof sketch of Theorem \ref{['thm: tempered']}
  • Theorem 5.1
  • Remark 5.2
  • proof : Proof sketch of Theorem \ref{['thm:catastrophic']}
  • Lemma A.1: devroye2006nonuniform, Theorem 2.1
  • Lemma A.2
  • Lemma A.3: shorack2009empirical, Chapter 8, Proposition 1
  • ...and 16 more