Understanding Optimal Feature Transfer via a Fine-Grained Bias-Variance Analysis

Yufan Li; Subhabrata Sen; Ben Adlam

Understanding Optimal Feature Transfer via a Fine-Grained Bias-Variance Analysis

Yufan Li, Subhabrata Sen, Ben Adlam

TL;DR

This work develops a tractable linear-transfer-learning model to study how upstream pretrained features influence downstream regression performance. By deriving exact asymptotics for the downstream risk $R^{\mathrm{avg}}$ and a fine-grained bias-variance decomposition, the authors reveal that the optimal pretrained representation $\widehat{\mathbf{B}}$ is often sparse and undergoes a phase transition between hard and soft feature selection as the effective rank changes. An end-to-end predictor (EEP) is proposed to minimize average risk across downstream tasks, and a minimax variant controls worst-case performance; empirically, the EEP outperforms baselines by balancing bias and variance across regimes. The analysis connects to PCR in the spectrum-only case and shows that optimal featurization adapts to both data covariances and task priors, with sparse, structured solutions emerging without explicit sparsity priors. These results offer practical insights for pretraining strategies and provide a rigorous lens on when and how sparsity and spectral alignment help transfer learning.

Abstract

In the transfer learning paradigm models learn useful representations (or features) during a data-rich pretraining stage, and then use the pretrained representation to improve model performance on data-scarce downstream tasks. In this work, we explore transfer learning with the goal of optimizing downstream performance. We introduce a simple linear model that takes as input an arbitrary pretrained feature transform. We derive exact asymptotics of the downstream risk and its \textit{fine-grained} bias-variance decomposition. We then identify the pretrained representation that optimizes the asymptotic downstream bias and variance averaged over an ensemble of downstream tasks. Our theoretical and empirical analysis uncovers the surprising phenomenon that the optimal featurization is naturally sparse, even in the absence of explicit sparsity-inducing priors or penalties. Additionally, we identify a phase transition where the optimal pretrained representation shifts from hard selection to soft selection of relevant features.

Understanding Optimal Feature Transfer via a Fine-Grained Bias-Variance Analysis

TL;DR

and a fine-grained bias-variance decomposition, the authors reveal that the optimal pretrained representation

is often sparse and undergoes a phase transition between hard and soft feature selection as the effective rank changes. An end-to-end predictor (EEP) is proposed to minimize average risk across downstream tasks, and a minimax variant controls worst-case performance; empirically, the EEP outperforms baselines by balancing bias and variance across regimes. The analysis connects to PCR in the spectrum-only case and shows that optimal featurization adapts to both data covariances and task priors, with sparse, structured solutions emerging without explicit sparsity priors. These results offer practical insights for pretraining strategies and provide a rigorous lens on when and how sparsity and spectral alignment help transfer learning.

Abstract

Paper Structure (37 sections, 17 theorems, 211 equations, 8 figures, 1 algorithm)

This paper contains 37 sections, 17 theorems, 211 equations, 8 figures, 1 algorithm.

Introduction
Contributions
Preliminaries
Setting
Ridgeless Regression with Pretrained Representation
Related Work
Analytic Results for Downstream Risk
Downstream-Optimal Feature Transfer
Structures of the Optimal Representation
Aligned, Spectrum-Only Case and Connection to PCR
Fully-Optimized Representation
Proofs of Main Results
Proof of \ref{['thm']}
Preliminary Bounds on Asymptotic Quantities
Characterization of Fine-Grained Bias-Variance Decomposition
...and 22 more sections

Key Result

Proposition 2.2

We have where $(\cdot)^+$ denotes Moore-Penrose pseudo-inverse and $\hat{\bm{\Gamma}}:=\widehat{\mathbf{Q}}^\top \hat{\bm{\Lambda}} \widehat{\mathbf{Q}} \in \mathbb{R}^{p\times p}$. Here, $\hat{\bm{\Lambda}} \in \mathbb{R}^{p\times p}$ is diagonal such that for $i=1,\ldots,p$ and $\bm{\lambda}=(\lambda_\mat We will use the notation $\hat{r}_i := \hat{\bm{\Lambda}}_{ii}$ for easier exposition.

Figures (8)

Figure 1: (left): Compare asymptotic risk $\mathbb{E}_{\mathbf{X} , {\bm{\varepsilon}} }\mathfrak{R}$ (denoted $R_{\mathrm{asy}}$) with empirical mean of simulated risk $R$ across 50 sample draws of $\mathbf{X} , {\bm{\varepsilon}}$ (${\bm{\beta} ^\star}$ fixed) of standard ridgeless predictor (denoted $\mathsf{R}$) and predictor with oracle featurization: $\widehat{\mathbf{B}}\gets \mathbf{B} ^\star, \lambda_\mathbf{\alpha}=\lambda=1,\lambda_\mathbf{\beta}=0$ (denoted $\mathsf{O}$). We fix $p=3000$ and vary $n$ from 8200 to 250 (x-axis on log-scale). (right): Compare asymptotic bias and variance $B , V$ (denoted $B_{\mathrm{asy}}, V_{\mathrm{asy}}$) of the two predictors with their simulated counter-parts (denoted $B_{\mathrm{sim}}, V_{\mathrm{sim}}$). All plots are generated with columns of $\mathbf{B} ^\star \in \mathbb{R}^{p \times q}, q=900$ drawn independently from $N(\bm{0}, \mathbf{\Sigma}^{\mathbf{B} ^\star}), \mathbf{\Sigma}^{\mathbf{B} ^\star}_{ij}=0.5^{|i-j|}$, $\mathbf{\Sigma}\sim \frac{1}{p}\mathbf{W} \mathbf{W}^\top+0.005\cdot \mathbf{I}_p, \mathbf{W}\sim N(\bm{0}, \mathbf{I}_p \otimes \mathbf{I}_p)$ and $\bm{\alpha} ^\star \sim N(\bm{0},\mathfrak{c}\cdot \mathbf{I})$. We maintain $\sigma^2=1$ and set $\mathfrak{c}$ such that $\mathsf{SNR}:=\norm{{\bm{\beta} ^\star}}_2/\sigma=10$.
Figure 2: (a)-(c): Empirical mean of the asymptotic risk $\mathfrak{R}$ (denoted $R$) over $3000$ draws of $\bm{\alpha} ^\star \sim N(\bm{0},\mathfrak{c}\cdot \mathbf{I})$, for $\widehat{\mathbf{B}} \in \mathbb{R}^{p\times k}$ as RP, OFP, and EEP. Error bars depict empirical mean and standard deviation of the actual risk $R$, evaluated from simulated data $(\mathbf{y} , \mathbf{X} )$ across different $\bm{\alpha} ^\star$ draws. (d)-(f): Empirical mean of bias $B$ (denoted $B$) and variance $V$ (denoted $V$) over $\bm{\alpha} ^\star$ draws. All plots are generated with columns of $\mathbf{B} ^\star \in \mathbb{R}^{p \times q}$ drawn independently from $N(\bm{0}, \mathbf{\Sigma}^{\mathbf{B} ^\star}), \mathbf{\Sigma}^{\mathbf{B} ^\star}_{ij}=0.5^{|i-j|}$ and $\mathbf{\Sigma}$ from $\frac{1}{m}\mathbf{W} \mathbf{W}^\top+0.005\cdot \mathbf{I}_p, \mathbf{W}\sim N(\bm{0}, \mathbf{I}_p \otimes \mathbf{I}_m)$, with $m$ being the approximate rank of $\mathbf{\Sigma}$. (a) and (d) fix $p=600, q=300$ and vary $n$ from $560$ to $100$. (b) and (e) vary $m$ for $q=50$. (c) and (f) varies $k$, the width of $\widehat{\mathbf{B}}$. (d) and (h) varies $q$. We maintain $\sigma^2=1$ and set $\mathfrak{c}$ such that $\mathsf{SNR}:=\norm{{\bm{\beta} ^\star}}_2/\sigma=10$. We set $n=100, p=m=k=200$ unless specified otherwise.
Figure 3: (a), (d), (g), (j): Heat map of the matrix $\mathbf{M}\in \mathbb{R}^{p\times p}, \mathbf{M}_{ij}=\hat{\mathbf{q}}_i^\top \mathbf{q} ^\star_j$ depicting the alignment between eigenvectors $\qty{\hat{\mathbf{q}}_i}_{i=1}^p$ of $\widehat{\mathbf{B}} {\widehat{\mathbf{B}}}^\top$ and eigenvectors $\qty{\mathbf{q}_i ^\star}_{i=1}^p$ of ground-truth feature $\mathbf{B} ^\star {\mathbf{B} ^\star}^\top$. (b), (e), (h), (k): Heat map of the matrix $\mathbf{N}\in \mathbb{R}^{p\times p}, \mathbf{N}_{ij}=\hat{\mathbf{q}}_i^\top \mathbf{u}_j$ depicting the alignment between eigenvectors $\qty{\hat{\mathbf{q}}_i}_{i=1}^p$ of $\widehat{\mathbf{B}} {\widehat{\mathbf{B}}}^\top$ and eigenvectors $\qty{\mathbf{u}_i}_{i=1}^p$ and eigenvectors of data covariance $\mathbf{\Sigma}$. (c), (f), (i), (l): eigenvalues of $\widehat{\mathbf{B}} \widehat{\mathbf{B}}^\top$. Top row depicts the regime $q=50<n=100$ and bottom row depicts the regime $q=150>n=100$. Left panel is for $\mathsf{SNR}=\norm{{\bm{\beta} ^\star}}_2/\sigma=25$ and right panel is for $\mathsf{SNR}=0.5$. Throughout, we set $n=100, p=200, \sigma^2=1, \mathbf{\Sigma}_{ij}=0.5^{|i-j|}$ and draw $\mathbf{B} ^\star \sim N(\bm{0}, \mathbf{I}_p \times \mathbf{I}_q), \bm{\alpha} ^\star\sim N(\bm{0}, \mathfrak{c}\cdot\mathbf{I}_q)$ where $\mathfrak{c}$ is chosen to adjust $\mathsf{SNR}$ to the specified levels.
Figure 4: (a)-(f): Optimized for $\mathfrak{B}^\mathsf{avg}$ only. (g)-(l): Optimized for $\mathcal{V}$ only. Otherwise, same settings as \ref{['heat']}
Figure 5: Optimize $\mathfrak{B}^\mathsf{avg}$ only. Set $\mathbf{B} ^\star {\mathbf{B} ^\star}^\top =\mathbf{I}_p$ and otherwise same as \ref{['heat']}
...and 3 more figures

Theorems & Definitions (42)

Definition 2.1: Predictor for Downstream Tasks
Proposition 2.2: Explicit Expression of the Optimizers
Remark 2.3: Monotonicity of $r(\cdot)$
Definition 3.2: Self-consistent equation
Theorem 3.3
Definition 4.1: End-to-end predictor (EEP)
Remark 4.2: Minimax Optimality
Proposition 4.3
Theorem 5.1
Remark 5.2
...and 32 more

Understanding Optimal Feature Transfer via a Fine-Grained Bias-Variance Analysis

TL;DR

Abstract

Understanding Optimal Feature Transfer via a Fine-Grained Bias-Variance Analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (42)