Table of Contents
Fetching ...

Transfer Learning of Linear Regression with Multiple Pretrained Models: Benefiting from More Pretrained Models via Overparameterization Debiasing

Daniel Boharon, Yehuda Dar

TL;DR

The results elucidate when using more pretrained models can improve transfer learning and propose a simple debiasing via multiplicative correction factor that can reduce the overparameterization bias and leverage more pretrained models to learn a target predictor.

Abstract

We study transfer learning for a linear regression task using several least-squares pretrained models that can be overparameterized. We formulate the target learning task as optimization that minimizes squared errors on the target dataset with penalty on the distance of the learned model from the pretrained models. We analytically formulate the test error of the learned target model and provide the corresponding empirical evaluations. Our results elucidate when using more pretrained models can improve transfer learning. Specifically, if the pretrained models are overparameterized, using sufficiently many of them is important for beneficial transfer learning. However, the learning may be compromised by overparameterization bias of pretrained models, i.e., the minimum $\ell_2$-norm solution's restriction to a small subspace spanned by the training examples in the high-dimensional parameter space. We propose a simple debiasing via multiplicative correction factor that can reduce the overparameterization bias and leverage more pretrained models to learn a target predictor.

Transfer Learning of Linear Regression with Multiple Pretrained Models: Benefiting from More Pretrained Models via Overparameterization Debiasing

TL;DR

The results elucidate when using more pretrained models can improve transfer learning and propose a simple debiasing via multiplicative correction factor that can reduce the overparameterization bias and leverage more pretrained models to learn a target predictor.

Abstract

We study transfer learning for a linear regression task using several least-squares pretrained models that can be overparameterized. We formulate the target learning task as optimization that minimizes squared errors on the target dataset with penalty on the distance of the learned model from the pretrained models. We analytically formulate the test error of the learned target model and provide the corresponding empirical evaluations. Our results elucidate when using more pretrained models can improve transfer learning. Specifically, if the pretrained models are overparameterized, using sufficiently many of them is important for beneficial transfer learning. However, the learning may be compromised by overparameterization bias of pretrained models, i.e., the minimum -norm solution's restriction to a small subspace spanned by the training examples in the high-dimensional parameter space. We propose a simple debiasing via multiplicative correction factor that can reduce the overparameterization bias and leverage more pretrained models to learn a target predictor.
Paper Structure (63 sections, 18 theorems, 173 equations, 36 figures, 2 algorithms)

This paper contains 63 sections, 18 theorems, 173 equations, 36 figures, 2 algorithms.

Key Result

Theorem 4.4

Under Assumptions assumption: H sum is full rank-assumption:Asymptotic settings, The expected test error of the closed-form solution $\widehat{\boldsymbol{\beta}}_{\mathrm{TL}}$ from eq:Closed-form is where $\mathbf{W} \!\triangleq\! \mathbf{R}^{-1} \mathbf{\Sigma_x} \mathbf{R}^{-1}$, $\mathbf{\Omega} \!\triangleq\! c(\alpha_{\mathrm{TL}}) \mathbf{W} + \alpha_{\mathrm{TL}} \mathbf{I}_d$, $c(\al

Figures (36)

  • Figure 1: Test error in the general case of Theorem \ref{['theorem:expected error for general case']}. Here, $\mathbf{H}_j$ corresponds to energy preserving subspace of dimension $\frac{d}{2}$ (see Appendix \ref{['app: Task relation']}), and the assumed task relation is $\widetilde{\mathbf{H}}_j=\mathbf{I}_d$. Target covariance is $\mathbf{\Sigma}_{\mathbf{x}} = \mathbf{I}_d$ (left subfig.) and exponential decay $(\mathbf{\Sigma}_\mathbf{x})_{il} = 0.5^{|i-l|}$ (right subfig.) See Appendix \ref{['app:fig:general']} for more experiments.
  • Figure 2: Test error in the simple case of Theorem \ref{['theorem:optimally tuned transfer learning error - asymptotic']}. $\mathbf{H}_j=\widetilde{\mathbf{H}}_j=\mathbf{I}_d$. See Appendix \ref{['app:fig:simple']} for more experiments.
  • Figure 3: Test error under debiasing. In Fig. \ref{['fig:debias1']} , $\mathbf{H}_j=\mathbf{I}_d$, and in Fig. \ref{['fig:debias2']}$\mathbf{H}_j$ is a subspace projection of dimension $\frac{3}{4}$ (for more details see Appendix \ref{['app: Task relation']}). The assumed task relation is $\widetilde{\mathbf{H}}_j=\rho_j\mathbf{I}_d$. Target covariance is $\mathbf{\Sigma}_{\mathbf{x}} = \mathbf{I}_d$ in both figures. For more figures see Appendix \ref{['app:fig:debias']}.
  • Figure 4: Bias-variance decomposition of the test error. Dashed and dotted lines denote the bias and variance terms, respectively. Without debiasing uses $\widetilde{\mathbf{H}}_j =\mathbf{I}_d$; with debiasing uses $\widetilde{\mathbf{H}}_j = \rho_j\mathbf{I}_d$. See Appendix \ref{['subsec:debiasing general case bias variance']} for a detailed discussion and more results.
  • Figure 5: Difference in target test error between transfer learning with and without debiasing. Negative values denote beneficial debiasing. Here, $\mathbf{H}_j=\mathbf{I}_d$; see Fig. \ref{['app:fig:debias_diff']} for other task relations.
  • ...and 31 more figures

Theorems & Definitions (18)

  • Theorem 4.4
  • Corollary 5.1
  • Theorem 5.2
  • Theorem 5.3
  • Theorem 5.4
  • Lemma 5.5
  • Theorem 5.6
  • Theorem 5.7
  • Theorem 5.8
  • Corollary 5.9
  • ...and 8 more