Benign Overfitting in Linear Regression
Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler
TL;DR
This work characterizes when a perfect fit to training data in linear regression remains predictive, by connecting the excess risk of the minimum-norm interpolator to two notions of covariance effective rank, r_k(Σ) and R_k(Σ). The results show that substantial overparameterization—many small-variance directions—is crucial for benign overfitting, with precise finite-sample bounds that depend on the tail behavior of the covariance spectrum. The analysis highlights fundamental differences between infinite- and finite-dimensional settings and draws connections to neural networks via the NTK framework, suggesting finite-dimensional approximations may be key to understanding benign overfitting in practice. The findings provide a rigorous lens for when interpolating predictions can achieve near-optimal accuracy and point to rich future directions, including extensions beyond linear models and relaxed distributional assumptions.
Abstract
The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finite-dimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lies in an infinite dimensional space versus when the data lies in a finite dimensional space whose dimension grows faster than the sample size.
