Table of Contents
Fetching ...

Transfer Learning in $\ell_1$ Regularized Regression: Hyperparameter Selection Strategy based on Sharp Asymptotic Analysis

Koki Okajima, Tomoyuki Obuchi

TL;DR

This work analyzes transfer learning for high-dimensional sparse regression within a generalized two-stage Trans-Lasso framework. Using the replica method, it derives sharp asymptotic generalization errors ε^(1st) and ε^(2nd) in terms of finite-order parameters Θ1, Θ2 that satisfy nonlinear equations of state, enabling principled hyperparameter selection. A key finding is that transferring either the support information or the pretrained vector alone suffices to achieve near-optimal performance, suggesting simple, robust hyperparameter strategies (Δλ = 0 or κ = 0) that often match more exhaustive LO tuning. These insights are corroborated by synthetic simulations and real-data experiments on IMDb and MNIST, showing practical reductions in hyperparameter search while maintaining or improving predictive accuracy. The results have direct implications for deploying transfer-learning in high-dimensional sparse regression, especially when target data are scarce or noisy.

Abstract

Transfer learning techniques aim to leverage information from multiple related datasets to enhance prediction quality against a target dataset. Such methods have been adopted in the context of high-dimensional sparse regression, and some Lasso-based algorithms have been invented: Trans-Lasso and Pretraining Lasso are such examples. These algorithms require the statistician to select hyperparameters that control the extent and type of information transfer from related datasets. However, selection strategies for these hyperparameters, as well as the impact of these choices on the algorithm's performance, have been largely unexplored. To address this, we conduct a thorough, precise study of the algorithm in a high-dimensional setting via an asymptotic analysis using the replica method. Our approach reveals a surprisingly simple behavior of the algorithm: Ignoring one of the two types of information transferred to the fine-tuning stage has little effect on generalization performance, implying that efforts for hyperparameter selection can be significantly reduced. Our theoretical findings are also empirically supported by applications on real-world and semi-artificial datasets using the IMDb and MNIST datasets, respectively.

Transfer Learning in $\ell_1$ Regularized Regression: Hyperparameter Selection Strategy based on Sharp Asymptotic Analysis

TL;DR

This work analyzes transfer learning for high-dimensional sparse regression within a generalized two-stage Trans-Lasso framework. Using the replica method, it derives sharp asymptotic generalization errors ε^(1st) and ε^(2nd) in terms of finite-order parameters Θ1, Θ2 that satisfy nonlinear equations of state, enabling principled hyperparameter selection. A key finding is that transferring either the support information or the pretrained vector alone suffices to achieve near-optimal performance, suggesting simple, robust hyperparameter strategies (Δλ = 0 or κ = 0) that often match more exhaustive LO tuning. These insights are corroborated by synthetic simulations and real-data experiments on IMDb and MNIST, showing practical reductions in hyperparameter search while maintaining or improving predictive accuracy. The results have direct implications for deploying transfer-learning in high-dimensional sparse regression, especially when target data are scarce or noisy.

Abstract

Transfer learning techniques aim to leverage information from multiple related datasets to enhance prediction quality against a target dataset. Such methods have been adopted in the context of high-dimensional sparse regression, and some Lasso-based algorithms have been invented: Trans-Lasso and Pretraining Lasso are such examples. These algorithms require the statistician to select hyperparameters that control the extent and type of information transfer from related datasets. However, selection strategies for these hyperparameters, as well as the impact of these choices on the algorithm's performance, have been largely unexplored. To address this, we conduct a thorough, precise study of the algorithm in a high-dimensional setting via an asymptotic analysis using the replica method. Our approach reveals a surprisingly simple behavior of the algorithm: Ignoring one of the two types of information transferred to the fine-tuning stage has little effect on generalization performance, implying that efforts for hyperparameter selection can be significantly reduced. Our theoretical findings are also empirically supported by applications on real-world and semi-artificial datasets using the IMDb and MNIST datasets, respectively.
Paper Structure (21 sections, 44 equations, 10 figures, 3 tables)

This paper contains 21 sections, 44 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Comparison of generalization error obtained from Claim \ref{['claim:SecondStage']} (solid line) with finite size simulations (markers). Error bars represent standard error obtained from $64$ realizations of data.
  • Figure 2: Generalization error and optimal hyperparameters for each strategy as a function of $\sigma$ for $(\alpha_1, \alpha_2) = (0.2, 0.8), (0.3, 0.8), (0.4, 0.8)$. The region shaded in red indicates the range of $\sigma$ where the $\Delta \lambda = 0$ strategy outperforms the $\kappa = 0$ strategy, while the region shaded in green indicates the opposite.
  • Figure 3: Ratios $\min \{ \epsilon_{\Delta \lambda = 0 }, \epsilon_{\kappa = 0} \} / \epsilon_{\rm LO}, \epsilon_{\rm Pretrain} / \epsilon_{\rm LO}$ and $\epsilon_{\rm Trans} / \epsilon_{\rm LO}$ under setting $(\pi^{(0)}, \pi^{(1)}, \pi^{(2)}) = (0.1, 0.09, 0.09)$ and noise level $\sigma = 0.01$ (top figure) and $0.1$ (bottom figure) for various values of $(\alpha^{(1)} , \alpha^{(2)})$. Note that each heatmap has its own color scale bar; see Appendix \ref{['appendix:additional']} for a direct comparison with shared color scale bars between the $\kappa = 0$ or $\Delta \lambda = 0$ strategy with Pretraining-Lasso and Trans-Lasso.
  • Figure 4: Second stage test error for each IMDB genre plotted against varying values of hyperparameters $\kappa$ and $\Delta \lambda / \lambda_2$. The plots demonstrate that the effect of $\Delta \lambda$ on generalization performance is tiny across genres.
  • Figure 5: Second stage test error for each MNIST image for SNR = 20 (top), 5 (middle) ,and 2 (bottom), plotted against varying values of hyperparameters $\kappa$ and $\Delta \lambda / \lambda_2$. White markers are placed at the minimizer of the test error for sake of visualization. The plots demonstrate that the effect of $\Delta \lambda$ on generalization performance is nontrivial.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Definition 1: Equations of state for the First Stage
  • Definition 2: Equations of state of the Second Stage
  • Claim 1: Generalization error of the first stage
  • Claim 2: Generalization error of the second stage
  • Claim 3: Expected inner product between the regressors