Table of Contents
Fetching ...

Data-Adaptive Automatic Threshold Calibration for Stability Selection

Martin Huang, Samuel Muller, Garth Tarr

TL;DR

This work tackles the sensitivity of stability selection to the stable-threshold parameter $\pi$ by introducing Exclusion Automatic Threshold Selection (EATS), a data-adaptive procedure that first filters potential noise via an exclusion threshold derived from shuffled data and then identifies the elbow-based $\hat{\pi}$ to form the stable set. The method combines Automatic Threshold Selection (ATS) with an Exclusion Probability Threshold (EPT) to calibrate $\pi$ without manual tuning, while preserving error control under standard exchangeability assumptions. Across extensive artificial and real-data experiments, EATS achieves higher Matthews correlation coefficients and reduced overselection, particularly in high-dimensional settings where $p>n$, and demonstrates robustness with respect to stability-selection procedures. The approach doubles computation due to the shuffled data step but yields a practical, tuning-free default for stability selection, with clear applicability to genomic and proteomic high-dimensional problems and potential extensions to complementary-pairs stability selection.

Abstract

Stability selection has gained popularity as a method for enhancing the performance of variable selection algorithms while controlling false discovery rates. However, achieving these desirable properties depends on correctly specifying the stable threshold parameter, which can be challenging. An arbitrary choice of this parameter can substantially alter the set of selected variables, as the variables' selection probabilities are inherently data-dependent. To address this issue, we propose Exclusion Automatic Threshold Selection (EATS), a data-adaptive algorithm that streamlines stability selection by automating the threshold specification process. EATS initially filters out potential noise variables using an exclusion probability threshold, derived from applying stability selection to a randomly shuffled version of the dataset. Following this, EATS selects the stable threshold parameter using the elbow method, balancing the marginal utility of including additional variables against the risk of selecting superfluous variables. We evaluate our approach through an extensive simulation study, benchmarking across commonly used variable selection algorithms and static stable threshold values.

Data-Adaptive Automatic Threshold Calibration for Stability Selection

TL;DR

This work tackles the sensitivity of stability selection to the stable-threshold parameter by introducing Exclusion Automatic Threshold Selection (EATS), a data-adaptive procedure that first filters potential noise via an exclusion threshold derived from shuffled data and then identifies the elbow-based to form the stable set. The method combines Automatic Threshold Selection (ATS) with an Exclusion Probability Threshold (EPT) to calibrate without manual tuning, while preserving error control under standard exchangeability assumptions. Across extensive artificial and real-data experiments, EATS achieves higher Matthews correlation coefficients and reduced overselection, particularly in high-dimensional settings where , and demonstrates robustness with respect to stability-selection procedures. The approach doubles computation due to the shuffled data step but yields a practical, tuning-free default for stability selection, with clear applicability to genomic and proteomic high-dimensional problems and potential extensions to complementary-pairs stability selection.

Abstract

Stability selection has gained popularity as a method for enhancing the performance of variable selection algorithms while controlling false discovery rates. However, achieving these desirable properties depends on correctly specifying the stable threshold parameter, which can be challenging. An arbitrary choice of this parameter can substantially alter the set of selected variables, as the variables' selection probabilities are inherently data-dependent. To address this issue, we propose Exclusion Automatic Threshold Selection (EATS), a data-adaptive algorithm that streamlines stability selection by automating the threshold specification process. EATS initially filters out potential noise variables using an exclusion probability threshold, derived from applying stability selection to a randomly shuffled version of the dataset. Following this, EATS selects the stable threshold parameter using the elbow method, balancing the marginal utility of including additional variables against the risk of selecting superfluous variables. We evaluate our approach through an extensive simulation study, benchmarking across commonly used variable selection algorithms and static stable threshold values.

Paper Structure

This paper contains 15 sections, 6 equations, 15 figures, 7 tables, 1 algorithm.

Figures (15)

  • Figure 1: An example of the effect on signal variable recovery, represented by Matthew's correlation coefficient (MCC), for a change in the stable threshold parameter, $\pi$. Formal definitions of the different methods are introduced in Section \ref{['chap:methodology']}. In this setting, the recommended range for $\pi \in [0.6,0.9]$ from meinshausen_stability_2010 does not overlap with the range of $\pi$ that achieves the maximum MCC. Our proposed method EATS returns a calibrated value of $\hat{\pi}$ in the optimal MCC range without the need to manually specify the parameter.
  • Figure 2: MCC score for simulation study settings (I) - (IV) with varying SNR across the different methods.
  • Figure 3: Number of selected variables for simulation study settings (I) - (IV) with varying SNR. The dashed line denotes the number of active variables $|\bm{\beta}_S|$.
  • Figure 4: The top row shows scree plots for the selection probabilities of the $30$ most frequently selected variables. The circle shape indicates an EATS-selected variable and falls on the left of the elbow (dotted line). The circle-cross shapes indicate non-selected variables. The red (black) highlighted points indicate the true signal (noise) variables. In the likelihood plots (bottom row), the dotted lines display the index of the variable that maximises the likelihood function. The dataset settings (I) - (IV) are used from the artificial datasets with SNR $=3$.
  • Figure 5: Influences in MCC when varying $p$ and $n$ for a fixed $\text{SNR} = 3$.
  • ...and 10 more figures