Understanding Optimal Feature Transfer via a Fine-Grained Bias-Variance Analysis
Yufan Li, Subhabrata Sen, Ben Adlam
TL;DR
This work develops a tractable linear-transfer-learning model to study how upstream pretrained features influence downstream regression performance. By deriving exact asymptotics for the downstream risk $R^{\mathrm{avg}}$ and a fine-grained bias-variance decomposition, the authors reveal that the optimal pretrained representation $\widehat{\mathbf{B}}$ is often sparse and undergoes a phase transition between hard and soft feature selection as the effective rank changes. An end-to-end predictor (EEP) is proposed to minimize average risk across downstream tasks, and a minimax variant controls worst-case performance; empirically, the EEP outperforms baselines by balancing bias and variance across regimes. The analysis connects to PCR in the spectrum-only case and shows that optimal featurization adapts to both data covariances and task priors, with sparse, structured solutions emerging without explicit sparsity priors. These results offer practical insights for pretraining strategies and provide a rigorous lens on when and how sparsity and spectral alignment help transfer learning.
Abstract
In the transfer learning paradigm models learn useful representations (or features) during a data-rich pretraining stage, and then use the pretrained representation to improve model performance on data-scarce downstream tasks. In this work, we explore transfer learning with the goal of optimizing downstream performance. We introduce a simple linear model that takes as input an arbitrary pretrained feature transform. We derive exact asymptotics of the downstream risk and its \textit{fine-grained} bias-variance decomposition. We then identify the pretrained representation that optimizes the asymptotic downstream bias and variance averaged over an ensemble of downstream tasks. Our theoretical and empirical analysis uncovers the surprising phenomenon that the optimal featurization is naturally sparse, even in the absence of explicit sparsity-inducing priors or penalties. Additionally, we identify a phase transition where the optimal pretrained representation shifts from hard selection to soft selection of relevant features.
