Beyond Benign Overfitting in Nadaraya-Watson Interpolators
Daniel Barzilai, Guy Kornowski, Ohad Shamir
TL;DR
This work analyzes the interpolating Nadaraya-Watson classifier with predictor $\\hat{h}_\\beta(\\mathbf{x}) = \\text{sign}\\left(\\sum_{i=1}^m \\frac{y_i}{\\|\\mathbf{x}-\\mathbf{x}_i\\|^{\\beta}}\\right)$. By varying the bandwidth parameter $\\beta$ relative to the ambient dimension $d$, the authors prove a tripartite generalization behavior: catastrophic overfitting for $\\beta<d$, benign overfitting at $\\beta=d$, and tempered overfitting for $\\beta>d$, with clean error scales characterized in terms of $p$ and logarithmic factors. They further argue that the optimal $\\beta$ aligns with the intrinsic dimension $d_{\\text{int}}$ of the data, implying that over-estimating $d_{\\text{int}}$ is often safer than under-estimating it; these insights are supported by experiments on synthetic data and real data like MNIST. Overall, the paper extends classic NW analysis by revealing rich, non-monotone generalization phenomena in a simple interpolating rule, and provides practical guidance for tuning kernel parameters in light of data geometry.
Abstract
In recent years, there has been much interest in understanding the generalization behavior of interpolating predictors, which overfit on noisy training data. Whereas standard analyses are concerned with whether a method is consistent or not, recent observations have shown that even inconsistent predictors can generalize well. In this work, we revisit the classic interpolating Nadaraya-Watson (NW) estimator (also known as Shepard's method), and study its generalization capabilities through this modern viewpoint. In particular, by varying a single bandwidth-like hyperparameter, we prove the existence of multiple overfitting behaviors, ranging non-monotonically from catastrophic, through benign, to tempered. Our results highlight how even classical interpolating methods can exhibit intricate generalization behaviors. In addition, for the purpose of tuning the hyperparameter, the results suggest that over-estimating the intrinsic dimension of the data is less harmful than under-estimating it. Numerical experiments complement our theory, demonstrating the same phenomena.
