Table of Contents
Fetching ...

Statistical Inference for Feature Selection after Optimal Transport-based Domain Adaptation

Nguyen Thang Loi, Duong Tan Loc, Vo Nguyen Le Duy

TL;DR

By carefully examining the FS process under DA whose operations can be characterized by linear and quadratic inequalities, it is proved that achieving FPR control in SFS-DA is indeed possible and enhanced the true detection rate by introducing a more strategic approach.

Abstract

Feature Selection (FS) under domain adaptation (DA) is a critical task in machine learning, especially when dealing with limited target data. However, existing methods lack the capability to guarantee the reliability of FS under DA. In this paper, we introduce a novel statistical method to statistically test FS reliability under DA, named SFS-DA (statistical FS-DA). The key strength of SFS-DA lies in its ability to control the false positive rate (FPR) below a pre-specified level $α$ (e.g., 0.05) while maximizing the true positive rate. Compared to the literature on statistical FS, SFS-DA presents a unique challenge in addressing the effect of DA to ensure the validity of the inference on FS results. We overcome this challenge by leveraging the Selective Inference (SI) framework. Specifically, by carefully examining the FS process under DA whose operations can be characterized by linear and quadratic inequalities, we prove that achieving FPR control in SFS-DA is indeed possible. Furthermore, we enhance the true detection rate by introducing a more strategic approach. Experiments conducted on both synthetic and real-world datasets robustly support our theoretical results, showcasing the superior performance of the proposed SFS-DA method.

Statistical Inference for Feature Selection after Optimal Transport-based Domain Adaptation

TL;DR

By carefully examining the FS process under DA whose operations can be characterized by linear and quadratic inequalities, it is proved that achieving FPR control in SFS-DA is indeed possible and enhanced the true detection rate by introducing a more strategic approach.

Abstract

Feature Selection (FS) under domain adaptation (DA) is a critical task in machine learning, especially when dealing with limited target data. However, existing methods lack the capability to guarantee the reliability of FS under DA. In this paper, we introduce a novel statistical method to statistically test FS reliability under DA, named SFS-DA (statistical FS-DA). The key strength of SFS-DA lies in its ability to control the false positive rate (FPR) below a pre-specified level (e.g., 0.05) while maximizing the true positive rate. Compared to the literature on statistical FS, SFS-DA presents a unique challenge in addressing the effect of DA to ensure the validity of the inference on FS results. We overcome this challenge by leveraging the Selective Inference (SI) framework. Specifically, by carefully examining the FS process under DA whose operations can be characterized by linear and quadratic inequalities, we prove that achieving FPR control in SFS-DA is indeed possible. Furthermore, we enhance the true detection rate by introducing a more strategic approach. Experiments conducted on both synthetic and real-world datasets robustly support our theoretical results, showcasing the superior performance of the proposed SFS-DA method.

Paper Structure

This paper contains 23 sections, 5 theorems, 66 equations, 10 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

The selective $p$-value proposed in (eq:selective_p) satisfies the property of a valid $p$-value:

Figures (10)

  • Figure 1: Illustration of the proposed SFS-DA method. Performing FS-DA without statistical inference leads to the selection of irrelevant feature ($X_4$). The naive $p$-value is small even for a falsely detected feature. With the proposed SFS-DA method, we successfully identify both false positives (FPs) and true positives (TPs), yielding large $p$-values for FPs and small $p$-values for TPs.
  • Figure 2: After performing DA, we apply FS to identify the relevant features. Next, we parametrize the data using a scalar parameter $z$ in the dimension of the test statistic to define the truncation region ${\mathcal{Z}}$, whose the data have the same FS results as the observed data. Finally, we conduct the inference by conditioning on ${\mathcal{Z}}$. To enhance the efficiency, we utilize a divide-and-conquer strategy to effectively identify the region ${\mathcal{Z}}$.
  • Figure 3: FPR and TPR in the case of Lasso
  • Figure 4: FPR and TPR in the case of elastic net
  • Figure 5: Computational cost of the proposed SFS-DA
  • ...and 5 more figures

Theorems & Definitions (8)

  • Example 1
  • Remark 1
  • Lemma 1
  • proof
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5