Table of Contents
Fetching ...

Understanding Optimal Feature Transfer via a Fine-Grained Bias-Variance Analysis

Yufan Li, Subhabrata Sen, Ben Adlam

TL;DR

This work develops a tractable linear-transfer-learning model to study how upstream pretrained features influence downstream regression performance. By deriving exact asymptotics for the downstream risk $R^{\mathrm{avg}}$ and a fine-grained bias-variance decomposition, the authors reveal that the optimal pretrained representation $\widehat{\mathbf{B}}$ is often sparse and undergoes a phase transition between hard and soft feature selection as the effective rank changes. An end-to-end predictor (EEP) is proposed to minimize average risk across downstream tasks, and a minimax variant controls worst-case performance; empirically, the EEP outperforms baselines by balancing bias and variance across regimes. The analysis connects to PCR in the spectrum-only case and shows that optimal featurization adapts to both data covariances and task priors, with sparse, structured solutions emerging without explicit sparsity priors. These results offer practical insights for pretraining strategies and provide a rigorous lens on when and how sparsity and spectral alignment help transfer learning.

Abstract

In the transfer learning paradigm models learn useful representations (or features) during a data-rich pretraining stage, and then use the pretrained representation to improve model performance on data-scarce downstream tasks. In this work, we explore transfer learning with the goal of optimizing downstream performance. We introduce a simple linear model that takes as input an arbitrary pretrained feature transform. We derive exact asymptotics of the downstream risk and its \textit{fine-grained} bias-variance decomposition. We then identify the pretrained representation that optimizes the asymptotic downstream bias and variance averaged over an ensemble of downstream tasks. Our theoretical and empirical analysis uncovers the surprising phenomenon that the optimal featurization is naturally sparse, even in the absence of explicit sparsity-inducing priors or penalties. Additionally, we identify a phase transition where the optimal pretrained representation shifts from hard selection to soft selection of relevant features.

Understanding Optimal Feature Transfer via a Fine-Grained Bias-Variance Analysis

TL;DR

This work develops a tractable linear-transfer-learning model to study how upstream pretrained features influence downstream regression performance. By deriving exact asymptotics for the downstream risk and a fine-grained bias-variance decomposition, the authors reveal that the optimal pretrained representation is often sparse and undergoes a phase transition between hard and soft feature selection as the effective rank changes. An end-to-end predictor (EEP) is proposed to minimize average risk across downstream tasks, and a minimax variant controls worst-case performance; empirically, the EEP outperforms baselines by balancing bias and variance across regimes. The analysis connects to PCR in the spectrum-only case and shows that optimal featurization adapts to both data covariances and task priors, with sparse, structured solutions emerging without explicit sparsity priors. These results offer practical insights for pretraining strategies and provide a rigorous lens on when and how sparsity and spectral alignment help transfer learning.

Abstract

In the transfer learning paradigm models learn useful representations (or features) during a data-rich pretraining stage, and then use the pretrained representation to improve model performance on data-scarce downstream tasks. In this work, we explore transfer learning with the goal of optimizing downstream performance. We introduce a simple linear model that takes as input an arbitrary pretrained feature transform. We derive exact asymptotics of the downstream risk and its \textit{fine-grained} bias-variance decomposition. We then identify the pretrained representation that optimizes the asymptotic downstream bias and variance averaged over an ensemble of downstream tasks. Our theoretical and empirical analysis uncovers the surprising phenomenon that the optimal featurization is naturally sparse, even in the absence of explicit sparsity-inducing priors or penalties. Additionally, we identify a phase transition where the optimal pretrained representation shifts from hard selection to soft selection of relevant features.
Paper Structure (37 sections, 17 theorems, 211 equations, 8 figures, 1 algorithm)

This paper contains 37 sections, 17 theorems, 211 equations, 8 figures, 1 algorithm.

Key Result

Proposition 2.2

We have where $(\cdot)^+$ denotes Moore-Penrose pseudo-inverse and $\hat{\bm{\Gamma}}:=\widehat{\mathbf{Q}}^\top \hat{\bm{\Lambda}} \widehat{\mathbf{Q}} \in \mathbb{R}^{p\times p}$. Here, $\hat{\bm{\Lambda}} \in \mathbb{R}^{p\times p}$ is diagonal such that for $i=1,\ldots,p$ and $\bm{\lambda}=(\lambda_\mat We will use the notation $\hat{r}_i := \hat{\bm{\Lambda}}_{ii}$ for easier exposition.

Figures (8)

  • Figure 1: (left): Compare asymptotic risk $\mathbb{E}_{\mathbf{X} , {\bm{\varepsilon}} }\mathfrak{R}$ (denoted $R_{\mathrm{asy}}$) with empirical mean of simulated risk $R$ across 50 sample draws of $\mathbf{X} , {\bm{\varepsilon}}$ (${\bm{\beta} ^\star}$ fixed) of standard ridgeless predictor (denoted $\mathsf{R}$) and predictor with oracle featurization: $\widehat{\mathbf{B}}\gets \mathbf{B} ^\star, \lambda_\mathbf{\alpha}=\lambda=1,\lambda_\mathbf{\beta}=0$ (denoted $\mathsf{O}$). We fix $p=3000$ and vary $n$ from 8200 to 250 (x-axis on log-scale). (right): Compare asymptotic bias and variance $B , V$ (denoted $B_{\mathrm{asy}}, V_{\mathrm{asy}}$) of the two predictors with their simulated counter-parts (denoted $B_{\mathrm{sim}}, V_{\mathrm{sim}}$). All plots are generated with columns of $\mathbf{B} ^\star \in \mathbb{R}^{p \times q}, q=900$ drawn independently from $N(\bm{0}, \mathbf{\Sigma}^{\mathbf{B} ^\star}), \mathbf{\Sigma}^{\mathbf{B} ^\star}_{ij}=0.5^{|i-j|}$, $\mathbf{\Sigma}\sim \frac{1}{p}\mathbf{W} \mathbf{W}^\top+0.005\cdot \mathbf{I}_p, \mathbf{W}\sim N(\bm{0}, \mathbf{I}_p \otimes \mathbf{I}_p)$ and $\bm{\alpha} ^\star \sim N(\bm{0},\mathfrak{c}\cdot \mathbf{I})$. We maintain $\sigma^2=1$ and set $\mathfrak{c}$ such that $\mathsf{SNR}:=\norm{{\bm{\beta} ^\star}}_2/\sigma=10$.
  • Figure 2: (a)-(c): Empirical mean of the asymptotic risk $\mathfrak{R}$ (denoted $R$) over $3000$ draws of $\bm{\alpha} ^\star \sim N(\bm{0},\mathfrak{c}\cdot \mathbf{I})$, for $\widehat{\mathbf{B}} \in \mathbb{R}^{p\times k}$ as RP, OFP, and EEP. Error bars depict empirical mean and standard deviation of the actual risk $R$, evaluated from simulated data $(\mathbf{y} , \mathbf{X} )$ across different $\bm{\alpha} ^\star$ draws. (d)-(f): Empirical mean of bias $B$ (denoted $B$) and variance $V$ (denoted $V$) over $\bm{\alpha} ^\star$ draws. All plots are generated with columns of $\mathbf{B} ^\star \in \mathbb{R}^{p \times q}$ drawn independently from $N(\bm{0}, \mathbf{\Sigma}^{\mathbf{B} ^\star}), \mathbf{\Sigma}^{\mathbf{B} ^\star}_{ij}=0.5^{|i-j|}$ and $\mathbf{\Sigma}$ from $\frac{1}{m}\mathbf{W} \mathbf{W}^\top+0.005\cdot \mathbf{I}_p, \mathbf{W}\sim N(\bm{0}, \mathbf{I}_p \otimes \mathbf{I}_m)$, with $m$ being the approximate rank of $\mathbf{\Sigma}$. (a) and (d) fix $p=600, q=300$ and vary $n$ from $560$ to $100$. (b) and (e) vary $m$ for $q=50$. (c) and (f) varies $k$, the width of $\widehat{\mathbf{B}}$. (d) and (h) varies $q$. We maintain $\sigma^2=1$ and set $\mathfrak{c}$ such that $\mathsf{SNR}:=\norm{{\bm{\beta} ^\star}}_2/\sigma=10$. We set $n=100, p=m=k=200$ unless specified otherwise.
  • Figure 3: (a), (d), (g), (j): Heat map of the matrix $\mathbf{M}\in \mathbb{R}^{p\times p}, \mathbf{M}_{ij}=\hat{\mathbf{q}}_i^\top \mathbf{q} ^\star_j$ depicting the alignment between eigenvectors $\qty{\hat{\mathbf{q}}_i}_{i=1}^p$ of $\widehat{\mathbf{B}} {\widehat{\mathbf{B}}}^\top$ and eigenvectors $\qty{\mathbf{q}_i ^\star}_{i=1}^p$ of ground-truth feature $\mathbf{B} ^\star {\mathbf{B} ^\star}^\top$. (b), (e), (h), (k): Heat map of the matrix $\mathbf{N}\in \mathbb{R}^{p\times p}, \mathbf{N}_{ij}=\hat{\mathbf{q}}_i^\top \mathbf{u}_j$ depicting the alignment between eigenvectors $\qty{\hat{\mathbf{q}}_i}_{i=1}^p$ of $\widehat{\mathbf{B}} {\widehat{\mathbf{B}}}^\top$ and eigenvectors $\qty{\mathbf{u}_i}_{i=1}^p$ and eigenvectors of data covariance $\mathbf{\Sigma}$. (c), (f), (i), (l): eigenvalues of $\widehat{\mathbf{B}} \widehat{\mathbf{B}}^\top$. Top row depicts the regime $q=50<n=100$ and bottom row depicts the regime $q=150>n=100$. Left panel is for $\mathsf{SNR}=\norm{{\bm{\beta} ^\star}}_2/\sigma=25$ and right panel is for $\mathsf{SNR}=0.5$. Throughout, we set $n=100, p=200, \sigma^2=1, \mathbf{\Sigma}_{ij}=0.5^{|i-j|}$ and draw $\mathbf{B} ^\star \sim N(\bm{0}, \mathbf{I}_p \times \mathbf{I}_q), \bm{\alpha} ^\star\sim N(\bm{0}, \mathfrak{c}\cdot\mathbf{I}_q)$ where $\mathfrak{c}$ is chosen to adjust $\mathsf{SNR}$ to the specified levels.
  • Figure 4: (a)-(f): Optimized for $\mathfrak{B}^\mathsf{avg}$ only. (g)-(l): Optimized for $\mathcal{V}$ only. Otherwise, same settings as \ref{['heat']}
  • Figure 5: Optimize $\mathfrak{B}^\mathsf{avg}$ only. Set $\mathbf{B} ^\star {\mathbf{B} ^\star}^\top =\mathbf{I}_p$ and otherwise same as \ref{['heat']}
  • ...and 3 more figures

Theorems & Definitions (42)

  • Definition 2.1: Predictor for Downstream Tasks
  • Proposition 2.2: Explicit Expression of the Optimizers
  • Remark 2.3: Monotonicity of $r(\cdot)$
  • Definition 3.2: Self-consistent equation
  • Theorem 3.3
  • Definition 4.1: End-to-end predictor (EEP)
  • Remark 4.2: Minimax Optimality
  • Proposition 4.3
  • Theorem 5.1
  • Remark 5.2
  • ...and 32 more