Table of Contents
Fetching ...

Replica Analysis for Ensemble Techniques in Variable Selection

Takashi Takahashi

TL;DR

This paper addresses the problem of evaluating and comparing ensemble-based variable selection methods in high-dimensional regimes where traditional $p$-value based approaches falter. It uses the replica method to derive a mean-field, single-body description of stability selection (SS) and derandomized knockoff (dKO) under proportional asymptotics ($N,M\to\infty$, $M/N\to\alpha$), yielding self-consistent equations for a small set of order parameters that govern selection performance. The main finding is that, across a range of $\alpha$ and noise levels, dKO with $\ell_1$ regularization generally outperforms vanilla knockoff and standard SS, while increasing the SS bootstrap rate $\mu_B$ can further enhance power; numerically, the theoretical predictions align well with simulations. The work provides a physics-inspired, quantitative framework for choosing ensemble variable-selection methods in high-dimensional statistics and suggests directions for extending the theory to more realistic, correlated data designs.

Abstract

Variable selection is a problem of statistics that aims to find the subset of the $N$-dimensional possible explanatory variables that are truly related to the generation process of the response variable. In high-dimensional setups, where the input dimension $N$ is comparable to the data size $M$, it is difficult to use classic methods based on $p$-values. Therefore, methods based on the ensemble learning are often used. In this review article, we introduce how the performance of these ensemble-based methods can be systematically analyzed using the replica method from statistical mechanics when $N$ and $M$ diverge at the same rate as $N,M\to\infty, M/N\toα\in(0,\infty)$. As a concrete application, we analyze the power of stability selection (SS) and the derandomized knockoff (dKO) with the $\ell_1$-regularized statistics in the high-dimensional linear model. The result indicates that dKO provably outperforms the vanilla knockoff and the standard SS, while increasing the bootstrap resampling rate in SS might further improve the detection power.

Replica Analysis for Ensemble Techniques in Variable Selection

TL;DR

This paper addresses the problem of evaluating and comparing ensemble-based variable selection methods in high-dimensional regimes where traditional -value based approaches falter. It uses the replica method to derive a mean-field, single-body description of stability selection (SS) and derandomized knockoff (dKO) under proportional asymptotics (, ), yielding self-consistent equations for a small set of order parameters that govern selection performance. The main finding is that, across a range of and noise levels, dKO with regularization generally outperforms vanilla knockoff and standard SS, while increasing the SS bootstrap rate can further enhance power; numerically, the theoretical predictions align well with simulations. The work provides a physics-inspired, quantitative framework for choosing ensemble variable-selection methods in high-dimensional statistics and suggests directions for extending the theory to more realistic, correlated data designs.

Abstract

Variable selection is a problem of statistics that aims to find the subset of the -dimensional possible explanatory variables that are truly related to the generation process of the response variable. In high-dimensional setups, where the input dimension is comparable to the data size , it is difficult to use classic methods based on -values. Therefore, methods based on the ensemble learning are often used. In this review article, we introduce how the performance of these ensemble-based methods can be systematically analyzed using the replica method from statistical mechanics when and diverge at the same rate as . As a concrete application, we analyze the power of stability selection (SS) and the derandomized knockoff (dKO) with the -regularized statistics in the high-dimensional linear model. The result indicates that dKO provably outperforms the vanilla knockoff and the standard SS, while increasing the bootstrap resampling rate in SS might further improve the detection power.
Paper Structure (15 sections, 40 equations, 4 figures, 1 table)

This paper contains 15 sections, 40 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Graphical representation of the replicated system defined as \ref{['eq: replicated system']}.
  • Figure 2: Comparison of theoretical predictions and experimental values for macroscopic quantities. (a)-(b) Results for SS. (c)-(d) Results for dKO. The markers with error bars represent the experimental values, where the error bars represent the standard error made by running experiments for several realizations of $D$. The solid lines represent the theoretical predictions. In all cases, parameters are set as $(\alpha,\rho,\Delta,\gamma_{\rm th}, \Pi_{{\rm th, dKO}} \Pi_{{\rm th,SS}})=(2.5,0.3,0.01,0.05,0.15,0.15)$.
  • Figure 3: Comparison of perfect reconstruction limits \ref{['eq: perfect reconstruction limit']} of each algorithm.
  • Figure 4: Comparisons of detection powers of several variable selection algorithms at several values of $(\alpha, \Delta)$. In all cases, $\rho$ is set as $0.5$. (a)-(d): Small noise case with $\Delta=0.01$. (e)-(h): Large noise case with $\Delta=0.1$. (a) and (e): $\alpha=0.36$. (b) and (f): $\alpha=0.63$. (c) and (g): $\alpha=1.12$. (d) and (h): $\alpha=2$.