Statistical Test for Feature Selection Pipelines by Selective Inference
Tomohiro Shiraishi, Tatsuya Matsukawa, Shuichi Nishino, Ichiro Takeuchi
TL;DR
This work addresses the challenge of assessing statistical significance for feature-selection pipelines that compose multiple algorithmic components for missing-value imputation, outlier detection, and feature selection. It develops a selective-inference framework that yields valid $p$-values conditioned on the pipeline’s data-driven selection, and proves finite-sample control of false positives for any pipeline configuration within a predefined class. The authors implement an auto-conditioning, line-search based method to compute selective $p$-values efficiently and demonstrate through synthetic and real-data experiments that the proposed approach outperforms naive and over-conditioned baselines in both type I error control and statistical power. The framework enables rigorous, reproducible inference across complex data-analysis pipelines and provides a practical tool for evaluating the reliability of pipeline-derived conclusions in linear-model settings.
Abstract
A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating various analysis algorithms. In this paper, we propose a novel statistical test to assess the significance of data analysis pipelines in feature selection problems. Our approach enables the systematic development of valid statistical tests applicable to any feature selection pipeline composed of predefined components. We develop this framework based on selective inference, a statistical technique that has recently gained attention for data-driven hypotheses. As a proof of concept, we consider feature selection pipelines for linear models, composed of three missing value imputation algorithms, three outlier detection algorithms, and three feature selection algorithms. We theoretically prove that our statistical test can control the probability of false positive feature selection at any desired level, and demonstrate its validity and effectiveness through experiments on synthetic and real data. Additionally, we present an implementation framework that facilitates testing across any configuration of these feature selection pipelines without extra implementation costs.
