Minimum-Norm Interpolation Under Covariate Shift

Neil Mallinar; Austin Zane; Spencer Frei; Bin Yu

Minimum-Norm Interpolation Under Covariate Shift

Neil Mallinar, Austin Zane, Spencer Frei, Bin Yu

TL;DR

This work develops the first finite-sample, instance-wise excess-risk bounds for the minimum-norm interpolator under covariate shift in high-dimensional linear regression, focusing on sources that satisfy benign overfitting and assuming commuting source and target covariances. It decomposes risk into bias and variance components and introduces a taxonomy of covariate shifts— Beneficial and Malignant—driven by eigenvalue ratios and the degree of overparameterization, including mild and severe regimes. The main theoretical results are complemented by synthetic and real-data experiments (e.g., CIFAR-10/10C and neural networks) that validate the shift taxonomy and show that overparameterization can improve out-of-distribution robustness under certain shifts. The findings illuminate when and how interpolation can remain robust under distribution shifts and open directions to extend the theory beyond simultaneous diagonalizability and into nonlinear models, with practical implications for transfer learning in noisy, high-dimensional settings.

Abstract

Transfer learning is a critical part of real-world machine learning deployments and has been extensively studied in experimental works with overparameterized neural networks. However, even in the simplest setting of linear regression a notable gap still exists in the theoretical understanding of transfer learning. In-distribution research on high-dimensional linear regression has led to the identification of a phenomenon known as \textit{benign overfitting}, in which linear interpolators overfit to noisy training labels and yet still generalize well. This behavior occurs under specific conditions on the source covariance matrix and input data dimension. Therefore, it is natural to wonder how such high-dimensional linear models behave under transfer learning. We prove the first non-asymptotic excess risk bounds for benignly-overfit linear interpolators in the transfer learning setting. From our analysis, we propose a taxonomy of \textit{beneficial} and \textit{malignant} covariate shifts based on the degree of overparameterization. We follow our analysis with empirical studies that show these beneficial and malignant covariate shifts for linear interpolators on real image data, and for fully-connected neural networks in settings where the input data dimension is larger than the training sample size.

Minimum-Norm Interpolation Under Covariate Shift

TL;DR

Abstract

Paper Structure (49 sections, 20 theorems, 150 equations, 12 figures)

This paper contains 49 sections, 20 theorems, 150 equations, 12 figures.

Introduction
Summary of contributions.
Prior Work and Comparisons to this Work
Preliminaries
Linear Models for Source and Target Data
Min-Norm Interpolator and Target Excess Risk
Separation of Components and Effective Ranks
Spiked Covariance Models
Main Theorems
A Taxonomy of Shifts
Overparameterization improves OOD robustness
Experiments
Synthetic Data Experiments
CIFAR-10 Experiments
Conclusion and Future Work
...and 34 more sections

Key Result

Theorem 2.2

(Target excess risk decomposition) The excess risk of the MNI trained on the source data, when evaluated on the target distribution, satisfies and where we define and $\left\|\boldsymbol{x}\right\|_M^2 := \boldsymbol{x}^\top M \boldsymbol{x}$.

Figures (12)

Figure 1: We experiment with the $(k, \delta, \epsilon)$ spiked covariance models and examine conditions for beneficial and malignant shifts as given in Theorem \ref{['thm:benficial_malignant_shifts']}. We take $n=60, k=10, \delta=1.0, \epsilon=1e^{-6}, \Tilde{\delta}=2.0, \Tilde{\epsilon} = 1e^{-7}$, and vary $p$. We see a cross-over from mild to severe overparameterization on the right side of $p=n$ where both OOD shifts swap between beneficial and malignant. For both ID and OOD curves, we observe that excess risk is a decreasing function if input dimension. Curves are averaged over 100 independent runs.
Figure 2: We train 3 layer ReLU dense neural networks with hidden width, $h$, on $n$ samples from $p$-dimensional Gaussians. ID test data is sampled from the same distribution and OOD test sets are constructed based on beneficial and malignant covariate shifts in our theory. Ground truth models are sampled as ${\theta_{\mathsf{s}}^*} \sim \mathcal{S}^{p-1}$, no model shift is invoked. For training data, $X$, train labels are given by $\boldsymbol{y}_{\mathsf{s}} = X {\theta_{\mathsf{s}}^*} + \boldsymbol{\varepsilon}_{\mathsf{s}}$ with label noise $\boldsymbol{\varepsilon}_{\mathsf{s}} \sim \mathcal{N}(0, \sigma^2)$. All runs reach train loss $< 5e^{-6}$. Points are averaged over 20 independent runs with standard error bars reported.
Figure 3: We experiment with a custom variant of CIFAR-10C in which we apply the blur and noise image filters directly to the test set images of CIFAR-10 at each severity level, e.g. Severity 1 means that we add a small amount of noise and a small amount of blurring to the image. In the top row we first use the noise filter and then the blur filter. In the bottom row we first use the blur filter and then the noise filter. In (a) and (d), we observe that the eigenvalue decay of the shifts are non-monotonic and mirror the $\alpha < 1, \beta > 1$ setting in our taxonomy. Indeed, we also see in (b) and (e) that when we are severely overparameterized the noisy tail effects appear to be suppressed and we still obtain beneficial shifts. On the other hand, in (c) and (f) we are in the mildly overparameterized regime and observe that the noisy tail effects hurt generalization, even for severity 4 in the top row which only adds a small amount of noise in the tail. These results are exactly in keeping with our taxonomy for the $\alpha < 1, \beta > 1$ case. All curves are averaged over 50 independent runs.
Figure 4: We fit interpolating linear models to random Gaussian data sampled from spiked covariance models with parameters $k, \delta, \epsilon$. In this setting, $k=70, n=500$, $p=4900$, $\delta=1$ and $\epsilon=0.005.$ To illustrate a beneficial shift, we scale the first $k$ eigenvalues by $\alpha=1.125$ and the last $p-k$ eigenvalues by $\beta=0.65$. Similarly, for the malignant shift we use $\alpha=0.875$ and $\beta=1.35$. All experiments are averaged over 25 independent runs with standard error bars displayed. Note that the bias is consistently below $10^{-16}$.
Figure 5: We experiment with the $(k, \delta, \epsilon)$ spiked covariance models and examine conditions for beneficial and malignant shifts as given in Theorem \ref{['thm:benficial_malignant_shifts']}. We take $n=50, k=10, \delta=1.0, \epsilon=1e^{-6}, \Tilde{\delta}=1.5, \Tilde{\epsilon} = 5e^{-7}$, and vary $p$. In all cases, $\mathrm{tr}(\Sigma_{\mathsf{t}}) > \mathrm{tr}(\Sigma_{\mathsf{s}})$, showing that beneficial shifts of this form can occur. As we increase $p$ while keeping other problem parameters fixed we observe the transition from mild to severe overparameterization and see the cross-over point between the shift going from beneficial to malignant. For both ID and OOD excess risk, we observe that excess risk is a decreasing function of input dimension. Curves are averaged over 100 independent runs.
...and 7 more figures

Theorems & Definitions (39)

Definition 1: Linear regression
Theorem 2.2
Definition 2: $(k, \delta, \epsilon)$-spike model
Theorem 3.1
Theorem 3.2
Theorem 3.3
Definition 3: Beneficial and Malignant shifts
Definition 4: Mild and severe overparameterization for multiplicative shifts
Theorem 3.4
Definition 5: Linear regression under distribution shift
...and 29 more

Minimum-Norm Interpolation Under Covariate Shift

TL;DR

Abstract

Minimum-Norm Interpolation Under Covariate Shift

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (39)