Statistical Inference for Sequential Feature Selection after Domain Adaptation
Duong Tan Loc, Nguyen Thang Loi, Vo Nguyen Le Duy
TL;DR
This work tackles reliable statistical inference for features selected by SeqFS after OT-based domain adaptation in high-dimensional regression. It introduces SI-SeqFS-DA, a selective inference framework that yields valid $p$-values while controlling the false positive rate at level $\alpha$ under domain shift, using a divide-and-conquer strategy to identify a truncation region ${\mathcal Z}$. The method extends naturally to backward SeqFS and model-selection criteria (AIC, BIC, adjusted $R^2$), with theoretical guarantees and extensive synthetic and real-data experiments showing improved FPR control and higher power. This approach enhances the reliability of SeqFS-DA in practice, especially when target data are scarce, and provides a scalable path to more robust feature selection under distributional shifts.
Abstract
In high-dimensional regression, feature selection methods, such as sequential feature selection (SeqFS), are commonly used to identify relevant features. When data is limited, domain adaptation (DA) becomes crucial for transferring knowledge from a related source domain to a target domain, improving generalization performance. Although SeqFS after DA is an important task in machine learning, none of the existing methods can guarantee the reliability of its results. In this paper, we propose a novel method for testing the features selected by SeqFS-DA. The main advantage of the proposed method is its capability to control the false positive rate (FPR) below a significance level $α$ (e.g., 0.05). Additionally, a strategic approach is introduced to enhance the statistical power of the test. Furthermore, we provide extensions of the proposed method to SeqFS with model selection criteria including AIC, BIC, and adjusted R-squared. Extensive experiments are conducted on both synthetic and real-world datasets to validate the theoretical results and demonstrate the proposed method's superior performance.
