Statistical Testing Framework for Clustering Pipelines by Selective Inference

Yugo Miyata; Tomohiro Shiraishi; Shunichi Nishino; Ichiro Takeuchi

Statistical Testing Framework for Clustering Pipelines by Selective Inference

Yugo Miyata, Tomohiro Shiraishi, Shunichi Nishino, Ichiro Takeuchi

Abstract

A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating multiple analysis algorithms.In many practical applications, analytical findings are obtained only after data pass through several data-dependent procedures within such pipelines.In this study, we address the problem of quantifying the statistical reliability of results produced by data analysis pipelines.As a proof of concept, we focus on clustering pipelines that identify cluster structures from complex and heterogeneous data through procedures such as outlier detection, feature selection, and clustering.We propose a novel statistical testing framework to assess the significance of clustering results obtained through these pipelines.Our framework, based on selective inference, enables the systematic construction of valid statistical tests for clustering pipelines composed of predefined components.We prove that the proposed test controls the type I error rate at any nominal level and demonstrate its validity and effectiveness through experiments on synthetic and real datasets.

Statistical Testing Framework for Clustering Pipelines by Selective Inference

Abstract

Paper Structure (63 sections, 3 theorems, 66 equations, 11 figures, 16 tables, 3 algorithms)

This paper contains 63 sections, 3 theorems, 66 equations, 11 figures, 16 tables, 3 algorithms.

Introduction
Related Work.
Contributions.
Preliminaries
Problem Setting.
Statistical Test for Pipelines.
Outlier Detection (OD) Algorithm Components.
Feature Selection (FS) Algorithm Components.
Clustering Algorithm Components.
Union and Intersection Components.
Selective Inference for Clustering Pipelines
Selective Inference.
Selective $p$-value.
Computations: Line Search Interpretation
Overview of the Line Search
...and 48 more sections

Key Result

Theorem 3.1

Consider a random data vector $\bm{X} \sim \mathcal{N}(\bm{\mu}, \bm{\Sigma})$ and an observed data vector $\bm{x}$. Let $(\mathcal{O}_{\bm{X}}, \mathcal{M}_{\bm{X}}, \mathcal{C}_{\bm{X}})$ and $(\mathcal{O}_{\bm{x}}, \mathcal{M}_{\bm{x}}, \mathcal{C}_{\bm{x}})$ be the pipeline outputs obtained by a follows a truncated normal distribution $\mathrm{TN}(\bm{\eta}^{\top}\bm{\mu},\, \bm{\eta}^{\top}\b

Figures (11)

Figure 1: Two examples of clustering pipelines composed of outlier detection (OD), feature selection (FS), and clustering components, for which the proposed framework provides statistically valid $p$-values without additional implementation effort.
Figure 2: Schematic diagram of the proposed line search method to identify the truncated region $\mathcal{Z}$. The top part shows the DAG representation of the pipeline and its topological sorting (i). The lower left part shows the operations executed sequentially according to the update rules (ii). The lower right part shows how the truncated region $\mathcal{Z}$ is identified by taking the union of several intervals based on parametric-programming (iii).
Figure 3: Type I Error Rate of option1 pipeline
Figure 7: Algorithm components of the clustering pipeline.
Figure 8: Type I Error Rate of option1 pipeline
...and 6 more figures

Theorems & Definitions (3)

Theorem 3.1
Theorem 3.2
Theorem 4.1

Statistical Testing Framework for Clustering Pipelines by Selective Inference

Abstract

Statistical Testing Framework for Clustering Pipelines by Selective Inference

Authors

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (3)