Algebraic and Statistical Properties of the Ordinary Least Squares Interpolator
Dennis Shen, Dogyoon Song, Peng Ding, Jasjeet S. Sekhon
TL;DR
This work analyzes the ordinary least squares (OLS) interpolator in the overparameterized regime, establishing high-dimensional algebraic analogs of classical results such as leave-$k$-out formulas, Cochran's formula, and the Frisch–Waugh–Lovell theorem for the minimum $\ell_2$-norm OLS. It shows that the OLS interpolator remains optimal among linear unbiased estimators in a high-dimensional sense (Gauss–Markov extension) and provides a novel, unbiased variance estimator under homoskedasticity for the classical regime, with conservative bias in high dimensions. The paper also develops row- and column-partitioned regression results to analyze generalization, omitted-variable bias in observational studies, and covariate adjustment in randomized experiments, linking algebraic decompositions to causal inference. Complemented by simulations under multiple covariate models, the results illuminate when benign overfitting occurs and how to perform reliable inference and treatment-effect estimation in high-dimensional settings.
Abstract
Deep learning research has uncovered the phenomenon of benign overfitting for overparameterized statistical models, which has drawn significant theoretical interest in recent years. Given its simplicity and practicality, the ordinary least squares (OLS) interpolator has become essential to gain foundational insights into this phenomenon. While properties of OLS are well established in classical, underparameterized settings, its behavior in high-dimensional, overparameterized regimes is less explored (unlike for ridge or lasso regression) though significant progress has been made of late. We contribute to this growing literature by providing fundamental algebraic and statistical results for the minimum $\ell_2$-norm OLS interpolator. In particular, we provide algebraic equivalents of (i) the leave-$k$-out residual formula, (ii) Cochran's formula, and (iii) the Frisch-Waugh-Lovell theorem in the overparameterized regime. These results aid in understanding the OLS interpolator's ability to generalize and have substantive implications for causal inference. Under the Gauss-Markov model, we present statistical results such as an extension of the Gauss-Markov theorem and an analysis of variance estimation under homoskedastic errors for the overparameterized regime. To substantiate our theoretical contributions, we conduct simulations that further explore the stochastic properties of the OLS interpolator.
