Table of Contents
Fetching ...

The Terminating-Random Experiments Selector: Fast High-Dimensional Variable Selection with False Discovery Rate Control

Jasin Machkour, Michael Muma, Daniel P. Palomar

TL;DR

The T-Rex selector outperforms state-of-the-art methods for FDR control in numerical experiments and on a simulated genome-wide association study (GWAS), while its sequential computation time is more than two orders of magnitude lower than that of the strongest benchmark methods.

Abstract

We propose the Terminating-Random Experiments (T-Rex) selector, a fast variable selection method for high-dimensional data. The T-Rex selector controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original predictors and multiple sets of randomly generated dummy predictors. A finite sample proof based on martingale theory for the FDR control property is provided. Numerical simulations confirm that the FDR is controlled at the target level while allowing for high power. We prove that the dummies can be sampled from any univariate probability distribution with finite expectation and variance. The computational complexity of the proposed method is linear in the number of variables. The T-Rex selector outperforms state-of-the-art methods for FDR control in numerical experiments and on a simulated genome-wide association study (GWAS), while its sequential computation time is more than two orders of magnitude lower than that of the strongest benchmark methods. The open source R package TRexSelector containing the implementation of the T-Rex selector is available on CRAN.

The Terminating-Random Experiments Selector: Fast High-Dimensional Variable Selection with False Discovery Rate Control

TL;DR

The T-Rex selector outperforms state-of-the-art methods for FDR control in numerical experiments and on a simulated genome-wide association study (GWAS), while its sequential computation time is more than two orders of magnitude lower than that of the strongest benchmark methods.

Abstract

We propose the Terminating-Random Experiments (T-Rex) selector, a fast variable selection method for high-dimensional data. The T-Rex selector controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original predictors and multiple sets of randomly generated dummy predictors. A finite sample proof based on martingale theory for the FDR control property is provided. Numerical simulations confirm that the FDR is controlled at the target level while allowing for high power. We prove that the dummies can be sampled from any univariate probability distribution with finite expectation and variance. The computational complexity of the proposed method is linear in the number of variables. The T-Rex selector outperforms state-of-the-art methods for FDR control in numerical experiments and on a simulated genome-wide association study (GWAS), while its sequential computation time is more than two orders of magnitude lower than that of the strongest benchmark methods. The open source R package TRexSelector containing the implementation of the T-Rex selector is available on CRAN.

Paper Structure

This paper contains 25 sections, 7 theorems, 56 equations, 15 figures, 1 table.

Key Result

Corollary 1

Let $\mathcal{Z}_{m, k}$ and $\mathcal{D}_{m, k}$ be the index sets of the non-included null and dummy variables in the $m$th LARSNote that Corollary corollary: 1 and subsequent results apply to all forward selection methods that select one (and do not drop any) variable in each forward selection st

Figures (15)

  • Figure 7: Ingredient 1 - sampling dummies from the univariate standard normal distribution. The sequential computation time of generating one dummy matrix for the proposed T-Rex selector is multiple orders of magnitude lower than the computation time of generating a knockoff matrix for the model-X knockoff method, which is a current benchmark. For example, for $p = 5{,}000$ and $L = p$, the T-Rex dummy generation process requires less than a second as compared to more than five hours for the model-X knockoff method. Even taking into account that the T-Rex selector requires, e.g., $K = 20$ of such dummy matrices, its sequential computation time is still multiple orders of magnitude lower than that of the model-X knockoff method. The jump in computation time for the model-X knockoff method between $p = 500$ and $p = 1{,}000$ is due to the suggestion of the authors to solve their proposed approximate semi-definite program (asdp) instead of their original semi-definite program for $p > 500$ in order to reduce the computation time required to generate model-X knockoffs.Note that both axes are scaled logarithmically. Setup: $n = 300$, $MC = 955$.
  • Figure 8: Ingredient 2 - early terminating the solution paths of the random experiments. Figure (a) exemplifies that, on average, the number of selected active variables quickly increases towards the sparsity level $p_{1}$ (i.e., the number of active variables) and already for three included dummies almost all active variables are selected on average. However, the number of selected null variables also increases with increasing $T$. Figure (b) illustrates that for $p = 5{,}000$ and $L = p$, when terminated early, the Terminating-LARS (T-LARS) algorithm (a fundamental building block of the T-Rex selector) is substantially faster than fitting the entire Lasso solution path using the pathwise coordinate descent algorithm for $2p$ variables as it is done by the fixed-X and model-X knockoff methods. Although the T-Rex selector needs to run the T-LARS algorithm for, e.g., $K = 20$ random experiments within the T-Rex selector, its sequential computation time is still comparable to that of a single run of "glmnet" in high-dimensional settings where $p$ is much larger than $n$. Moreover, the independent random experiments can be run in parallel on multicore computers to achieve a substantial reduction in computation time. The "glmnet" computation time is used as the reference computation time and its absolute value is given above the reference line (format: hh:mm:ss). Note that after $T = 150$ dummies are included the computation time of the T-LARS algorithm does not increase further because the T-LARS algorithm includes at most $\min\lbrace n, p + L \rbrace = n = 300$ variables and with $T = 150$ we can expect that, on average, also $150$ null variables plus the $5$ active variables are included.
  • Figure 9: Ingredient 3 - fusing the candidate sets based on their relative occurrences and a voting level $v \in [0.5, 1)$. The number of selected active variables remains high when increasing the voting level, while the number of selected null variables decreases faster with increasing $v$. Setup: $n = 150$, $p = 300$, $p_{1} = 5$, $T = 3$, $L = p$, $K = 20$, $\text{SNR} = 1$, $MC = 500$.
  • Figure 10: Exemplary numerical verification of Corollary \ref{['corollary: 2']} and A-\ref{['assumption: 1']}: The histogram of the number of included null variables in Figure (a) approximates the theoretical probability mass function (PMF). The expected value of a random variable following the negative hypergeometric distribution with the parameters specified in the last sentence of this caption is given by $T \cdot p_{0} \, / \, (L + 1) = 20 \cdot 290 \, / \, (300 + 1) \approx 19.27$, which fits the mean of the histogram. The Q-Q plot in Figure (b) confirms that the number of included null variables follows the negative hypergeometric distribution. Setup: $n = 150$, $p = 300$, $p_{1} = 5$, $T = 20$, $L = p$, $K = 20$, $\text{SNR} = 1$, $MC = 500$.
  • Figure 11: Exemplary numerical verification of A-\ref{['assumption: 2']}: For $v \geq 0.5$, a random variable following the negative hypergeometric distribution stochastically dominates the random variable $V_{T, L}(v)$ (i.e., the number of selected null variables) at almost all values of $V_{T, L}(v)$. Setup: $n = 150$, $p = 300$, $p_{1} = 5$, $T = 20$, $L = p$, $K = 20$, $\text{SNR} = 1$, $MC = 500$.
  • ...and 10 more figures

Theorems & Definitions (21)

  • Corollary 1
  • proof
  • Corollary 2
  • proof
  • Corollary 3
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • ...and 11 more