Table of Contents
Fetching ...

Source-Optimal Training is Transfer-Suboptimal

C. Evans Hedges

TL;DR

The paper shows that training a source model for its own task is generically suboptimal for downstream transfer, with the transfer-optimal regularization τ0* diverging from the source-optimal τS* in an alignment-dependent manner. Using L2-SP ridge regression, it derives finite-sample and deterministic-equivalent conditions for when transfer improves risk, identifies a unique transfer-optimal source penalty, and reveals a phase transition driven by task alignment ρ. Empirical validation in synthetic ridge experiments and CIFAR-10 confirms the theory and demonstrates the phenomenon persists in nonlinear networks. The findings imply that pretraining and regularization should account for downstream transfer objectives rather than optimizing solely for source-task performance, with practical guidance depending on task alignment and source data quality.

Abstract

We prove that training a source model optimally for its own task is generically suboptimal when the objective is downstream transfer. We study the source-side optimization problem in L2-SP ridge regression and show a fundamental mismatch between the source-optimal and transfer-optimal source regularization: outside of a measure-zero set, $τ_0^* \neq τ_S^*$. We characterize the transfer-optimal source penalty $τ_0^*$ as a function of task alignment and identify an alignment-dependent reversal: with imperfect alignment ($0<ρ<1$), transfer benefits from stronger source regularization, while in super-aligned regimes ($ρ>1$), transfer benefits from weaker regularization. In isotropic settings, the decision of whether transfer helps is independent of the target sample size and noise, depending only on task alignment and source characteristics. We verify the linear predictions in a synthetic ridge regression experiment, and we present CIFAR-10 experiments as evidence that the source-optimal versus transfer-optimal mismatch can persist in nonlinear networks.

Source-Optimal Training is Transfer-Suboptimal

TL;DR

The paper shows that training a source model for its own task is generically suboptimal for downstream transfer, with the transfer-optimal regularization τ0* diverging from the source-optimal τS* in an alignment-dependent manner. Using L2-SP ridge regression, it derives finite-sample and deterministic-equivalent conditions for when transfer improves risk, identifies a unique transfer-optimal source penalty, and reveals a phase transition driven by task alignment ρ. Empirical validation in synthetic ridge experiments and CIFAR-10 confirms the theory and demonstrates the phenomenon persists in nonlinear networks. The findings imply that pretraining and regularization should account for downstream transfer objectives rather than optimizing solely for source-task performance, with practical guidance depending on task alignment and source data quality.

Abstract

We prove that training a source model optimally for its own task is generically suboptimal when the objective is downstream transfer. We study the source-side optimization problem in L2-SP ridge regression and show a fundamental mismatch between the source-optimal and transfer-optimal source regularization: outside of a measure-zero set, . We characterize the transfer-optimal source penalty as a function of task alignment and identify an alignment-dependent reversal: with imperfect alignment (), transfer benefits from stronger source regularization, while in super-aligned regimes (), transfer benefits from weaker regularization. In isotropic settings, the decision of whether transfer helps is independent of the target sample size and noise, depending only on task alignment and source characteristics. We verify the linear predictions in a synthetic ridge regression experiment, and we present CIFAR-10 experiments as evidence that the source-optimal versus transfer-optimal mismatch can persist in nonlinear networks.

Paper Structure

This paper contains 20 sections, 7 theorems, 71 equations, 2 figures.

Key Result

Lemma 3.2

The expected risk of the transfer estimator decomposes into pure bias, variance induced by the $\beta_0$ prior, and variance induced by estimation error: with:

Figures (2)

  • Figure 1: Synthetic validation of the alignment-dependent phase transition. The y-axis shows the ratio of transfer-optimal to source-optimal regularization ($\lambda_{TL}^*/\lambda_S^*$). A ratio $>1$ indicates over-regularization is beneficial (Standard Regime), while $<1$ indicates under-regularization is optimal (Super-Aligned Regime). The shaded region represents the 95% confidence interval over 10 seeds.
  • Figure 2: CIFAR-10 quality-shift proxy with identical labels: source training uses corrupted inputs while target fine-tuning uses clean inputs. We sweep the source weight decay and plot mean $\pm$ one standard deviation over seeds for (i) source accuracy on corrupted validation data and (ii) transfer accuracy on the clean CIFAR-10 test set after fine-tuning on a small clean target subset. Transfer-optimal performance occurs at stronger source regularization than source-optimal performance in this nonlinear network.

Theorems & Definitions (15)

  • Lemma 3.2
  • Theorem 3.3
  • Corollary 3.4
  • Theorem 3.6
  • Corollary 3.7
  • Theorem 3.8
  • Corollary 3.9
  • proof
  • proof
  • proof
  • ...and 5 more