Statistical Test for Feature Selection Pipelines by Selective Inference

Tomohiro Shiraishi; Tatsuya Matsukawa; Shuichi Nishino; Ichiro Takeuchi

Statistical Test for Feature Selection Pipelines by Selective Inference

Tomohiro Shiraishi, Tatsuya Matsukawa, Shuichi Nishino, Ichiro Takeuchi

TL;DR

This work addresses the challenge of assessing statistical significance for feature-selection pipelines that compose multiple algorithmic components for missing-value imputation, outlier detection, and feature selection. It develops a selective-inference framework that yields valid $p$-values conditioned on the pipeline’s data-driven selection, and proves finite-sample control of false positives for any pipeline configuration within a predefined class. The authors implement an auto-conditioning, line-search based method to compute selective $p$-values efficiently and demonstrate through synthetic and real-data experiments that the proposed approach outperforms naive and over-conditioned baselines in both type I error control and statistical power. The framework enables rigorous, reproducible inference across complex data-analysis pipelines and provides a practical tool for evaluating the reliability of pipeline-derived conclusions in linear-model settings.

Abstract

A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating various analysis algorithms. In this paper, we propose a novel statistical test to assess the significance of data analysis pipelines in feature selection problems. Our approach enables the systematic development of valid statistical tests applicable to any feature selection pipeline composed of predefined components. We develop this framework based on selective inference, a statistical technique that has recently gained attention for data-driven hypotheses. As a proof of concept, we consider feature selection pipelines for linear models, composed of three missing value imputation algorithms, three outlier detection algorithms, and three feature selection algorithms. We theoretically prove that our statistical test can control the probability of false positive feature selection at any desired level, and demonstrate its validity and effectiveness through experiments on synthetic and real data. Additionally, we present an implementation framework that facilitates testing across any configuration of these feature selection pipelines without extra implementation costs.

Statistical Test for Feature Selection Pipelines by Selective Inference

TL;DR

-values conditioned on the pipeline’s data-driven selection, and proves finite-sample control of false positives for any pipeline configuration within a predefined class. The authors implement an auto-conditioning, line-search based method to compute selective

-values efficiently and demonstrate through synthetic and real-data experiments that the proposed approach outperforms naive and over-conditioned baselines in both type I error control and statistical power. The framework enables rigorous, reproducible inference across complex data-analysis pipelines and provides a practical tool for evaluating the reliability of pipeline-derived conclusions in linear-model settings.

Abstract

Paper Structure (63 sections, 3 theorems, 54 equations, 13 figures, 1 table, 1 algorithm)

This paper contains 63 sections, 3 theorems, 54 equations, 13 figures, 1 table, 1 algorithm.

Introduction
Related Work.
Contributions.
Preliminaries
Problem Setting.
Statistical Test for Pipelines.
Missing-Value Imputation (MVI) Algorithm Components.
Outlier Detection (OD) Algorithm Components.
Feature Selection (FS) Algorithm Components.
Union and Intersection Components.
Automatic Pipeline Construction.
Selective Inference (SI).
Selective Inference for Data Analysis Pipeline
Selective Inference.
Selective $p$-value.
...and 48 more sections

Key Result

Theorem 1

Consider a constant design matrix $X$, a random response vector $\bm{Y}\sim\mathcal{N}(\bm{\mu}, \sigma^2I_{n^\prime})$ and an observed response vector $\bm{y}$. Let $(\mathcal{M}_{\bm{Y}}, \mathcal{O}_{\bm{Y}})$ and $(\mathcal{M}_{\bm{y}}, \mathcal{O}_{\bm{y}})$ be the pairs of selected features an is truncated normal distribution $\mathrm{TN}(\bm{\eta}^\top\bm{\mu}, \sigma^2\|\bm{\eta}\|^2, \mat

Figures (13)

Figure 1: Two examples of pipelines within the class considered in this study.
Figure 2: Schematic illustration of the proposed line search method to identify the truncation intervals $\mathcal{Z}$. The upper part shows the DAG representation of the pipeline and its topological sorting (i). The lower left part shows the operations performed by update rules in sequence (ii). The lower right part shows the identification of the truncation intervals $\mathcal{Z}$ by taking the union of some intervals based on parametric-programming (iii).
Figure 3: Type I Error Rate and Power of op1 pipeline
Figure 7: op1
Figure 8: op2
...and 8 more figures

Theorems & Definitions (3)

Theorem 1
Theorem 2
Theorem 3

Statistical Test for Feature Selection Pipelines by Selective Inference

TL;DR

Abstract

Statistical Test for Feature Selection Pipelines by Selective Inference

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (3)