Malign Overfitting: Interpolation Can Provably Preclude Invariance
Yoav Wald, Gal Yona, Uri Shalit, Yair Carmon
TL;DR
The paper proves a fundamental barrier: in a simple linear two-environment Gaussian mixture with core and spurious features, any interpolating classifier with nonzero margin cannot be invariant, leading to high robust error. In contrast, a two-phase, non-interpolating estimator can provably learn a nearly invariant classifier with low robust error by separating training and post-processing stages and enforcing an invariance constraint (e.g., Equal Opportunity). Through simulations and Waterbirds experiments, the authors show that interpolation undermines invariance, while the proposed two-phase approach yields reliable invariance and fairness gains across high-dimensional regimes. These results highlight a crucial limit of interpolation-based invariance methods and motivate finite-sample, two-stage strategies for robust, fair generalization in practice.
Abstract
Learned classifiers should often possess certain invariance properties meant to encourage fairness, robustness, or out-of-distribution generalization. However, multiple recent works empirically demonstrate that common invariance-inducing regularizers are ineffective in the over-parameterized regime, in which classifiers perfectly fit (i.e. interpolate) the training data. This suggests that the phenomenon of "benign overfitting", in which models generalize well despite interpolating, might not favorably extend to settings in which robustness or fairness are desirable. In this work we provide a theoretical justification for these observations. We prove that -- even in the simplest of settings -- any interpolating learning rule (with arbitrarily small margin) will not satisfy these invariance properties. We then propose and analyze an algorithm that -- in the same setting -- successfully learns a non-interpolating classifier that is provably invariant. We validate our theoretical observations on simulated data and the Waterbirds dataset.
