Table of Contents
Fetching ...

The Common Intuition to Transfer Learning Can Win or Lose: Case Studies for Linear Regression

Yehuda Dar, Daniel LeJeune, Richard G. Baraniuk

TL;DR

This paper analyzes transfer learning between two linear regression tasks in highly overparameterized regimes. It introduces an intuitive TL objective that regularizes the distance between target parameters and transferred source parameters, and it derives exact and asymptotic generalization expressions under orthonormal task-relations, showing TL can resolve the double-descent peak and often outperform optimally tuned ridge regression when source and target are sufficiently related. The authors also reveal that ignoring the true task relation (e.g., using $\widetilde{\mathbf{H}}=\mathbf{I}_d$) can improve generalization in some settings due to conditioning effects, and they formulate a linear MMSE (LMMSE) transfer-learning estimator that universally improves over the intuitive approach. The work further extends to misspecified models and general task-relations, providing rigorous results and highlighting the practical value of transfer learning as a regularizer and of LMMSE as a principled optimal linear strategy.

Abstract

We study a fundamental transfer learning process from source to target linear regression tasks, including overparameterized settings where there are more learned parameters than data samples. The target task learning is addressed by using its training data together with the parameters previously computed for the source task. We define a transfer learning approach to the target task as a linear regression optimization with a regularization on the distance between the to-be-learned target parameters and the already-learned source parameters. We analytically characterize the generalization performance of our transfer learning approach and demonstrate its ability to resolve the peak in generalization errors in double descent phenomena of the minimum L2-norm solution to linear regression. Moreover, we show that for sufficiently related tasks, the optimally tuned transfer learning approach can outperform the optimally tuned ridge regression method, even when the true parameter vector conforms to an isotropic Gaussian prior distribution. Namely, we demonstrate that transfer learning can beat the minimum mean square error (MMSE) solution of the independent target task. Our results emphasize the ability of transfer learning to extend the solution space to the target task and, by that, to have an improved MMSE solution. We formulate the linear MMSE solution to our transfer learning setting and point out its key differences from the common design philosophy to transfer learning.

The Common Intuition to Transfer Learning Can Win or Lose: Case Studies for Linear Regression

TL;DR

This paper analyzes transfer learning between two linear regression tasks in highly overparameterized regimes. It introduces an intuitive TL objective that regularizes the distance between target parameters and transferred source parameters, and it derives exact and asymptotic generalization expressions under orthonormal task-relations, showing TL can resolve the double-descent peak and often outperform optimally tuned ridge regression when source and target are sufficiently related. The authors also reveal that ignoring the true task relation (e.g., using ) can improve generalization in some settings due to conditioning effects, and they formulate a linear MMSE (LMMSE) transfer-learning estimator that universally improves over the intuitive approach. The work further extends to misspecified models and general task-relations, providing rigorous results and highlighting the practical value of transfer learning as a regularizer and of LMMSE as a principled optimal linear strategy.

Abstract

We study a fundamental transfer learning process from source to target linear regression tasks, including overparameterized settings where there are more learned parameters than data samples. The target task learning is addressed by using its training data together with the parameters previously computed for the source task. We define a transfer learning approach to the target task as a linear regression optimization with a regularization on the distance between the to-be-learned target parameters and the already-learned source parameters. We analytically characterize the generalization performance of our transfer learning approach and demonstrate its ability to resolve the peak in generalization errors in double descent phenomena of the minimum L2-norm solution to linear regression. Moreover, we show that for sufficiently related tasks, the optimally tuned transfer learning approach can outperform the optimally tuned ridge regression method, even when the true parameter vector conforms to an isotropic Gaussian prior distribution. Namely, we demonstrate that transfer learning can beat the minimum mean square error (MMSE) solution of the independent target task. Our results emphasize the ability of transfer learning to extend the solution space to the target task and, by that, to have an improved MMSE solution. We formulate the linear MMSE solution to our transfer learning setting and point out its key differences from the common design philosophy to transfer learning.

Paper Structure

This paper contains 37 sections, 8 theorems, 77 equations, 9 figures.

Key Result

Lemma 4.1

\newlabellemma:well specified - out of sample error - target task - nonzero alpha - expectation of eigenvalues - orthonormal H0 Under Assumptions assumption: H is full rank-assumption:isotropic prior on beta - well specified, and for $\widetilde{\mathbf{H}}=\mathbf{H}=\mathbf{\Psi}^{T}$ where $\ma where $\lambda_{k}\lbrace{{{\mathbf{X}}_{\mathbf{\Psi}}^{T} \mathbf{X}_{\mathbf{\Psi}}}}\rbrace$ is

Figures (9)

  • Figure 1: The test error of the target task under isotropic Gaussian assumption on $\boldsymbol{\beta}$ and isotropic target features. Here $\widetilde{\mathbf{H}}=\mathbf{H}=\mathbf{\Psi}^T$ where $\mathbf{\Psi}$ is the orthonormal DCT matrix. Analytical results are presented in solid lines: red curves correspond to minimum $\ell_2$-norm (ML2N) solutions of the target task, green curves correspond to optimally tuned ridge regression, blue curves correspond to optimally tuned transfer learning (TL) in its intuitive form from Section \ref{['sec:Well-specified Feature Selection']}. The corresponding empirical results (errors averaged over 150 experiments) are denoted by markers in the relevant colors. The number of data samples for the target task is $n=64$ and for the source task is $\widetilde{n}=128$. The misspecified models in (b) correspond to Assumptions \ref{['assumption:misspecification']}-\ref{['assumption:Independent misspecification with isotropic features']} and polynomial reduction with $a=2.5$, $q=500$, $\rho=2$.
  • Figure 1: The test error of the target task under isotropic Gaussian assumption on $\boldsymbol{\beta}$ and isotropic target features in a well specified setting. The matrix $\mathbf{H}$ is a $d\times d$ circulant matrix corresponding to the discrete version of the continuous-domain convolution kernel ${h_{\rm ker}(\tau)=\delta(\tau)+ e^{-\frac{\lvert\tau-0.5\rvert}{w_{\rm ker}}}}$, here the kernel width is ${w_{\rm ker}=2/75}$ in (a)-(b) and ${w_{\rm ker}=2/25}$ in (c)-(d). Curve colors and markers are as in Fig. \ref{['fig:error_curves_diagrams_isotropic_H_is_orthonormal']}. The number of data samples for the target task is $n=64$ and for the source task is $\widetilde{n}=128$.
  • Figure 1: The test error of the target task under isotropic Gaussian assumption on $\boldsymbol{\beta}$ and isotropic target features. The matrix $\mathbf{H}$ is a $d\times d$ circulant matrix corresponding to the discrete version of the continuous-domain convolution kernel ${h_{\rm ker}(\tau)=\delta(\tau)+ e^{-\frac{\lvert\tau-0.5\rvert}{w_{\rm ker}}}}$, here the kernel width is ${w_{\rm ker}=2/75}$. Both subfigures correspond to misspecified models according to Assumptions \ref{['assumption:misspecification']}-\ref{['assumption:Independent misspecification with isotropic features']} and polynomial reduction with $a=2.5$, $q=500$, $\rho=25$. The number of data samples for the target task is $n=64$ and for the source task is $\widetilde{n}=128$.
  • Figure 2: The effect of less source training samples than target training samples, i.e., $n>\widetilde{n}$. Number of data samples for the target task is $n=128$ and for the source task is $\widetilde{n}=64$. The test error of the target task under isotropic Gaussian assumption on $\boldsymbol{\beta}$ and isotropic target features. Both subfigures refer to well specified settings with $\widetilde{\mathbf{H}}=\mathbf{H}$. In (a), $\mathbf{H}=\mathbf{\Psi}^T$ where $\mathbf{\Psi}$ is the orthonormal DCT matrix. In (b), $\mathbf{H}$ is a $d\times d$ circulant matrix corresponding to the discrete version of the continuous-domain convolution kernel ${h_{\rm ker}(\tau)=\delta(\tau)+ e^{-\frac{\lvert\tau-0.5\rvert}{w_{\rm ker}}}}$ with ${w_{\rm ker}=2/75}$. The non-orthonormal $\mathbf{H}$ is defined in more detail in Section \ref{['subsec:Analysis for H of a General Form - Analysis of the Intuitive Transfer Learning Approach']}.
  • Figure 2: The test error of the target task under isotropic Gaussian assumption on $\boldsymbol{\beta}$ and isotropic target features in a misspecified setting (according to Assumptions \ref{['assumption:misspecification']}-\ref{['assumption:Independent misspecification with isotropic features']} and polynomial reduction with $a=2.5$, $q=500$, $\rho=2$). The matrix $\mathbf{H}$ is a $d\times d$ circulant matrix corresponding to the discrete version of the continuous-domain convolution kernel ${h_{\rm ker}(\tau)=\delta(\tau)+ e^{-\frac{\lvert\tau-0.5\rvert}{w_{\rm ker}}}}$, here the kernel width is ${w_{\rm ker}=2/75}$ in (a)-(b) and ${w_{\rm ker}=2/25}$ in (c)-(d). The number of data samples for the target task is $n=64$ and for the source task is $\widetilde{n}=128$.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Lemma 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Corollary 4.4
  • Proposition 6.1
  • Theorem 6.2
  • Theorem 6.3
  • Lemma E.1