Generalization error of min-norm interpolators in transfer learning

Yanke Song; Sohom Bhattacharya; Pragya Sur

Generalization error of min-norm interpolators in transfer learning

Yanke Song, Sohom Bhattacharya, Pragya Sur

TL;DR

The generalization error of pooled min-$\ell_2$-norm interpolation in transfer learning where data from diverse distributions are available is established and a novel anisotropic local law is established to achieve these characterizations.

Abstract

This paper establishes the generalization error of pooled min-$\ell_2$-norm interpolation in transfer learning where data from diverse distributions are available. Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their out-of-distribution risk when samples from the test distribution are unavailable during training. However, in many applications, a limited amount of test data may be available during training, yet properties of min-norm interpolation in this setting are not well-understood. We address this gap by characterizing the bias and variance of pooled min-$\ell_2$-norm interpolation under covariate and model shifts. The pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications: under model shift, for low signal-to-noise ratio (SNR), adding data always hurts. For higher SNR, transfer learning helps as long as the shift-to-signal (SSR) ratio lies below a threshold that we characterize explicitly. By consistently estimating these ratios, we provide a data-driven method to determine: (i) when the pooled interpolator outperforms the target-based interpolator, and (ii) the optimal number of target samples that minimizes the generalization error. Under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between between domains improves the risk, and vice versa. We establish a novel anisotropic local law to achieve these characterizations, which may be of independent interest in random matrix theory. We supplement our theoretical characterizations with comprehensive simulations that demonstrate the finite-sample efficacy of our results.

Generalization error of min-norm interpolators in transfer learning

TL;DR

The generalization error of pooled min-

-norm interpolation in transfer learning where data from diverse distributions are available is established and a novel anisotropic local law is established to achieve these characterizations.

Abstract

This paper establishes the generalization error of pooled min-

-norm interpolation in transfer learning where data from diverse distributions are available. Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their out-of-distribution risk when samples from the test distribution are unavailable during training. However, in many applications, a limited amount of test data may be available during training, yet properties of min-norm interpolation in this setting are not well-understood. We address this gap by characterizing the bias and variance of pooled min-

-norm interpolation under covariate and model shifts. The pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications: under model shift, for low signal-to-noise ratio (SNR), adding data always hurts. For higher SNR, transfer learning helps as long as the shift-to-signal (SSR) ratio lies below a threshold that we characterize explicitly. By consistently estimating these ratios, we provide a data-driven method to determine: (i) when the pooled interpolator outperforms the target-based interpolator, and (ii) the optimal number of target samples that minimizes the generalization error. Under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between between domains improves the risk, and vice versa. We establish a novel anisotropic local law to achieve these characterizations, which may be of independent interest in random matrix theory. We supplement our theoretical characterizations with comprehensive simulations that demonstrate the finite-sample efficacy of our results.

Paper Structure (36 sections, 33 theorems, 230 equations, 2 figures)

This paper contains 36 sections, 33 theorems, 230 equations, 2 figures.

Introduction
Setup
Data Model
Estimator
Risk
Model Shift
Risk under isotropic design
Performance Comparison
Choice of interpolator
Extension to multiple source distributions
Numerical examples under model shift
Covariate Shift
Risk under simultaneously diagonalizable covariances
Performance Comparison Example: When does heterogeneity help?
Numerical examples under covariate shift
...and 21 more sections

Key Result

Lemma 2.1

Under Assumption assumption1, the min-norm interpolator eqn:interpolator has variance and bias where $\hat{\bm{\Sigma}} = \bm{X}^\top\bm{X}/n$ is the (uncentered) sample covariance matrix obtained on appending the source and target samples, and $\tilde{\bm{\beta}}:=\bm{\beta}^{(1)} - \bm{\beta}^{(2)}$ is the signal shift vector.

Figures (2)

Figure 1: Generalization error of pooled-$\ell_2$-norm interpolator under covariate shift. Solid lines: theoretically predicted values. $+$ marks: empirical values. Dotted horizontal line: Risk for min-$\ell_2$-norm interpolator using only the target data. Design choices: $\beta_i^{(2)}\sim \mathcal{N}(0,\sigma_\beta^2/p)$ where $\sigma_\beta^2=\textnormal{SNR}$, $\beta_i^{(1)}=\beta_i^{(2)} + \mathcal{N}(0,\sigma_s^2/p)$ where $\sigma_s^2 = \sigma_\beta^2 \cdot \textnormal{SSR}$, $\bm{\Sigma}^{(1)}=\bm{\Sigma}^{(2)} = \bm{I}$, $\bm{X}^{(k)}$ has i.i.d. rows from $\mathcal{N}(0,\bm{\Sigma}^{(k)})$. The signals and design matrices are then held fixed. We generate $50$ random $\epsilon_i \sim \mathcal{N}(0,1)$ and the average empirical risks are presented.
Figure 2: Generalization error of pooled-$\ell_2$-norm interpolator under covariate shift. Solid lines: theoretically predicted values. $+$ marks: empirical values. Dotted horizontal line: Risk for min-$\ell_2$-norm interpolator using only the target data. Design choices: $\beta_i^{(1)}=\beta_i^{(2)} \sim \mathcal{N}(0,\sigma_\beta^2)$ where $\sigma_\beta^2=\textnormal{SNR}$, $\bm{\Sigma}^{(2)} = \bm{I}$ and $\bm{\Sigma}^{(1)}$ has two distinct eigenvalues $\kappa$ and $1/\kappa$, $\bm{X}^{(k)}$ has i.i.d. rows from $\mathcal{N}(0,\bm{\Sigma}^{(k)})$. The signals and design matrices are then held fixed. We generate $50$ random $\epsilon_i \sim \mathcal{N}(0,1)$ and report the average empirical risks.

Theorems & Definitions (57)

Lemma 2.1
proof
Theorem 3.1: Risk under model shift
Proposition 3.2: Corollary of Theorem 1 in hastie2022surprises
Proposition 3.3: Impact of model shift
proof
Corollary 3.4
proof
Proposition 3.5
Definition 4.1: Simultaneously diagonalizable
...and 47 more

Generalization error of min-norm interpolators in transfer learning

TL;DR

Abstract

Generalization error of min-norm interpolators in transfer learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (57)