Table of Contents
Fetching ...

Post-Transfer Learning Statistical Inference in High-Dimensional Regression

Nguyen Vu Khai Tam, Cao Huyen My, Vo Nguyen Le Duy

TL;DR

This paper addresses statistical inference after transfer learning in high-dimensional regression (TL-HDR), where standard p-values are invalid due to data-dependent feature selection. It introduces PTL-SI, a selective-inference framework tailored to the TransFusion TL-HDR method, providing valid $p$-values that control the false positive rate at a chosen level $α$ while boosting power via a divide-and-conquer truncation-region identification. The authors prove the validity of the selective p-values and demonstrate the method on synthetic and real-world datasets, including an extension to Oracle Trans-Lasso. The work enables reliable significance testing of transferred features in HDR contexts, improving interpretability and trust in TL-HDR analyses.

Abstract

Transfer learning (TL) for high-dimensional regression (HDR) is an important problem in machine learning, particularly when dealing with limited sample size in the target task. However, there currently lacks a method to quantify the statistical significance of the relationship between features and the response in TL-HDR settings. In this paper, we introduce a novel statistical inference framework for assessing the reliability of feature selection in TL-HDR, called PTL-SI (Post-TL Statistical Inference). The core contribution of PTL-SI is its ability to provide valid $p$-values to features selected in TL-HDR, thereby rigorously controlling the false positive rate (FPR) at desired significance level $α$ (e.g., 0.05). Furthermore, we enhance statistical power by incorporating a strategic divide-and-conquer approach into our framework. We demonstrate the validity and effectiveness of the proposed PTL-SI through extensive experiments on both synthetic and real-world high-dimensional datasets, confirming its theoretical properties and utility in testing the reliability of feature selection in TL scenarios.

Post-Transfer Learning Statistical Inference in High-Dimensional Regression

TL;DR

This paper addresses statistical inference after transfer learning in high-dimensional regression (TL-HDR), where standard p-values are invalid due to data-dependent feature selection. It introduces PTL-SI, a selective-inference framework tailored to the TransFusion TL-HDR method, providing valid $p$-values that control the false positive rate at a chosen level $α$ while boosting power via a divide-and-conquer truncation-region identification. The authors prove the validity of the selective p-values and demonstrate the method on synthetic and real-world datasets, including an extension to Oracle Trans-Lasso. The work enables reliable significance testing of transferred features in HDR contexts, improving interpretability and trust in TL-HDR analyses.

Abstract

Transfer learning (TL) for high-dimensional regression (HDR) is an important problem in machine learning, particularly when dealing with limited sample size in the target task. However, there currently lacks a method to quantify the statistical significance of the relationship between features and the response in TL-HDR settings. In this paper, we introduce a novel statistical inference framework for assessing the reliability of feature selection in TL-HDR, called PTL-SI (Post-TL Statistical Inference). The core contribution of PTL-SI is its ability to provide valid -values to features selected in TL-HDR, thereby rigorously controlling the false positive rate (FPR) at desired significance level (e.g., 0.05). Furthermore, we enhance statistical power by incorporating a strategic divide-and-conquer approach into our framework. We demonstrate the validity and effectiveness of the proposed PTL-SI through extensive experiments on both synthetic and real-world high-dimensional datasets, confirming its theoretical properties and utility in testing the reliability of feature selection in TL scenarios.

Paper Structure

This paper contains 21 sections, 6 theorems, 106 equations, 13 figures, 3 algorithms.

Key Result

Lemma 1

The selective $p$-value proposed in eq:p_selective satisfies the validity property:

Figures (13)

  • Figure 1: Illustration of the proposed PTL-SI method. Conducting post-transfer learning analysis without statistical inference in high-dimensional regression may lead to the erroneous identification of irrelevant features (e.g., $X_6$). The naive $p$-value is even small for falsely selected feature. In contrast, with the proposed PTL-SI method, we can identify both false positives (FPs) and true positives (TPs). i.e., large $p$-values for irrelevant features and small $p$-values for truly informative ones, thereby enhancing the reliability of feature selection after transfer learning.
  • Figure 2: Illustration of the Selective Inference method tailored for TransFusion. First, the TransFusion algorithm is applied to the combined source and target data, yielding the target estimate $\hat{\boldsymbol{\beta}}_{\rm TransFusion}^{(0)}$ and identifying the selected feature set $\mathcal{M}_{\text{obs}}$. The data are then parameterized using a scalar parameter $z$ in the dimension of the test statistic to define the truncation region $\mathcal{Z}$, for which data yield the same feature selection results. To improve computational efficiency, a divide-and-conquer strategy is employed to effectively identify $\mathcal{Z}$. Finally, valid statistical inference is performed within the identified region $\mathcal{Z}$.
  • Figure 3: FPR and TPR w.r.t. the number of target instances $n_T$
  • Figure 4: FPR and TPR w.r.t. the true beta $\Gamma$
  • Figure 5: FPR and TPR w.r.t. the noise intensity $\Upsilon$
  • ...and 8 more figures

Theorems & Definitions (14)

  • Remark 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Remark 2
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • ...and 4 more