Heterogeneous transfer learning for high-dimensional regression with feature mismatch

Jae Ho Chang; Massimiliano Russo; Subhadeep Paul

Heterogeneous transfer learning for high-dimensional regression with feature mismatch

Jae Ho Chang, Massimiliano Russo, Subhadeep Paul

TL;DR

HTL with feature mismatch addresses high-dimensional regression when target data lack some covariates that are available in a rich proxy dataset. The method learns a feature map from the proxy (linear or nonparametric via sieve) to impute missing target features, then performs a two-stage penalized regression using both matched and imputed features. Nonasymptotic upper bounds are derived for estimation and prediction errors, detailing dependence on proxy-target quality, map discrepancy, and sample sizes; results extend to multiple proxy domains. Simulations and an ovarian cancer gene-expression case study demonstrate that HTL-impute outperforms homogeneous TL and target-only approaches, offering improved prediction and more reliable inference in settings with feature mismatch and data-poor targets.

Abstract

We consider Heterogeneous Transfer Learning (HTL) from a source to a new target domain for high-dimensional regression with differing feature sets. Most homogeneous TL methods assume that target and source domains share the same feature space, which limits their practical applicability. In applications, the target and source features are frequently different due to the inability to measure certain variables in data-poor target environments. Conversely, existing HTL methods do not provide statistical error guarantees, limiting their utility for scientific discovery. Our method first learns a feature map between the missing and observed features, leveraging the vast source data, and then imputes the missing features in the target. Using the combined matched and imputed features, we then perform a two-step transfer learning for penalized regression. We develop upper bounds on estimation and prediction errors, assuming that the source and target parameters differ sparsely but without assuming sparsity in the target model. We obtain results for both when the feature map is linear and when it is nonparametrically specified as unknown functions. Our results elucidate how estimation and prediction errors of HTL depend on the model's complexity, sample size, the quality and differences in feature maps, and differences in the models across domains.

Heterogeneous transfer learning for high-dimensional regression with feature mismatch

TL;DR

Abstract

Heterogeneous transfer learning for high-dimensional regression with feature mismatch

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (39)