Statistical Inference for Sequential Feature Selection after Domain Adaptation

Duong Tan Loc; Nguyen Thang Loi; Vo Nguyen Le Duy

Statistical Inference for Sequential Feature Selection after Domain Adaptation

Duong Tan Loc, Nguyen Thang Loi, Vo Nguyen Le Duy

TL;DR

This work tackles reliable statistical inference for features selected by SeqFS after OT-based domain adaptation in high-dimensional regression. It introduces SI-SeqFS-DA, a selective inference framework that yields valid $p$-values while controlling the false positive rate at level $\alpha$ under domain shift, using a divide-and-conquer strategy to identify a truncation region ${\mathcal Z}$. The method extends naturally to backward SeqFS and model-selection criteria (AIC, BIC, adjusted $R^2$), with theoretical guarantees and extensive synthetic and real-data experiments showing improved FPR control and higher power. This approach enhances the reliability of SeqFS-DA in practice, especially when target data are scarce, and provides a scalable path to more robust feature selection under distributional shifts.

Abstract

In high-dimensional regression, feature selection methods, such as sequential feature selection (SeqFS), are commonly used to identify relevant features. When data is limited, domain adaptation (DA) becomes crucial for transferring knowledge from a related source domain to a target domain, improving generalization performance. Although SeqFS after DA is an important task in machine learning, none of the existing methods can guarantee the reliability of its results. In this paper, we propose a novel method for testing the features selected by SeqFS-DA. The main advantage of the proposed method is its capability to control the false positive rate (FPR) below a significance level $α$ (e.g., 0.05). Additionally, a strategic approach is introduced to enhance the statistical power of the test. Furthermore, we provide extensions of the proposed method to SeqFS with model selection criteria including AIC, BIC, and adjusted R-squared. Extensive experiments are conducted on both synthetic and real-world datasets to validate the theoretical results and demonstrate the proposed method's superior performance.

Statistical Inference for Sequential Feature Selection after Domain Adaptation

TL;DR

-values while controlling the false positive rate at level

under domain shift, using a divide-and-conquer strategy to identify a truncation region

. The method extends naturally to backward SeqFS and model-selection criteria (AIC, BIC, adjusted

), with theoretical guarantees and extensive synthetic and real-data experiments showing improved FPR control and higher power. This approach enhances the reliability of SeqFS-DA in practice, especially when target data are scarce, and provides a scalable path to more robust feature selection under distributional shifts.

Abstract

(e.g., 0.05). Additionally, a strategic approach is introduced to enhance the statistical power of the test. Furthermore, we provide extensions of the proposed method to SeqFS with model selection criteria including AIC, BIC, and adjusted R-squared. Extensive experiments are conducted on both synthetic and real-world datasets to validate the theoretical results and demonstrate the proposed method's superior performance.

Paper Structure (34 sections, 5 theorems, 78 equations, 22 figures, 2 algorithms)

This paper contains 34 sections, 5 theorems, 78 equations, 22 figures, 2 algorithms.

Introduction
Problem Statement
Optimal Transport (OT)-based DA flamary2016optimal
Sequential Feature Selection (SeqFS) after OT-based DA
Statistical Inference on the Selected Features
Computation of a valid $p$-value
Proposed Method
The valid $p$-value in SI-SeqFS-DA
Characterization of the Conditioning Event
Identification of Truncation Region ${\mathcal{Z}}$
Divide-and-conquer strategy
Solving of each sub-problem
Computation of ${\mathcal{Z}}$ in (\ref{['eq:cZ']}) by combining multiple sub-problems
Extensions to Backward SeqFS and Criteria for Optimal Model Selection
Backward SeqFS
...and 19 more sections

Key Result

lemma thmcounterlemma

The selective $p$-value proposed in (eq:selective_p) is a valid $p$-value, i.e.,

Figures (22)

Figure 1: Illustration of the proposed SI-SeqFS-DA method. When SeqFS-DA is performed without statistical inference, it often results in the selection of irrelevant features (e.g., $X_4$), as the naive $p$-value for these features may appear small, even though they are falsely detected. In contrast, the SI-SeqFS-DA method improves feature selection by effectively distinguishing between false positives (FPs) and true positives (TPs). It assigns large $p$-values to FPs and small $p$-values to TPs, ensuring accurate identification of relevant features.
Figure 2: Illustration of the SI-SeqFS-DA method. First, we transform the data using domain adaptation (DA). Subsequently, sequential feature selection (SeqFS) is applied to identify the relevant features. The data is then parameterized using a scalar parameter $z$, defined in the dimension of the test statistic, to determine the truncation region ${\mathcal{Z}}$. To improve computational efficiency, a divide-and-conquer strategy is employed to effectively identify ${\mathcal{Z}}$. Finally, valid statistical inference is performed within the identified region ${\mathcal{Z}}$.
Figure 3: FPR and TPR in the case of Forward SeqFS
Figure 4: FPR and TPR in the case of Backward SeqFS
Figure 5: FPR and TPR in the case of Forward SeqFS with AIC
...and 17 more figures

Theorems & Definitions (10)

lemma thmcounterlemma
proof
lemma thmcounterlemma
proof
lemma thmcounterlemma
proof
lemma thmcounterlemma
proof
remark thmcounterremark
lemma thmcounterlemma

Statistical Inference for Sequential Feature Selection after Domain Adaptation

TL;DR

Abstract

Statistical Inference for Sequential Feature Selection after Domain Adaptation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (10)