Estimating the False Discovery Rate of Variable Selection

Yixiang Luo; William Fithian; Lihua Lei

Estimating the False Discovery Rate of Variable Selection

Yixiang Luo, William Fithian, Lihua Lei

TL;DR

This work presents a general-purpose estimator for the false discovery rate of any variable selection procedure, valid across Gaussian linear models, Gaussian graphical models, and model-X settings. The estimator decomposes FDR into per-variable terms, uses Rao-Blackwellization to obtain unbiased components under the null, and applies a normalization via p-values to produce a conservative overall estimate. It is complemented by a bootstrap-based standard error procedure and supported by theoretical variance bounds and asymptotic bootstrap validity in a Gaussian-linear regime with block-orthogonal design. Real-data applications (HIV drug resistance and protein signaling) alongside comprehensive simulations demonstrate that the estimator provides actionable insight into the trade-off between predictive accuracy and variable-selection accuracy, often surpassing what cross-validation alone conveys. The accompanying software (R package hFDR) enables practitioners to apply the method to diverse problems and model choices with interpretable FDR guidance along the model complexity path.

Abstract

We introduce a generic estimator for the false discovery rate of any model selection procedure, in common statistical modeling settings including the Gaussian linear model, Gaussian graphical model, and model-X setting. We prove that our method has a conservative (non-negative) bias in finite samples under standard statistical assumptions, and provide a bootstrap method for assessing its standard error. For methods like the Lasso, forward-stepwise regression, and the graphical Lasso, our estimator serves as a valuable companion to cross-validation, illuminating the tradeoff between prediction error and variable selection accuracy as a function of the model complexity parameter.

Estimating the False Discovery Rate of Variable Selection

TL;DR

Abstract

Paper Structure (42 sections, 27 theorems, 312 equations, 15 figures, 2 algorithms)

This paper contains 42 sections, 27 theorems, 312 equations, 15 figures, 2 algorithms.

Introduction
FDR estimation in the Gaussian linear model
General formulation of our estimator
Notation and statistical assumptions
FDR estimation
Understanding our method
Example model assumptions
Standard error of $\widehat{\textnormal{FDR}}$
Theoretical bound for standard error
Standard error estimation by bootstrap
Theory of parametric bootstrap variance estimation
Real world examples
HIV drug resistance studies
Protein-signaling network
Simulation studies
...and 27 more sections

Key Result

Theorem 2.1

Suppose for each $j = 1, \ldots, d$, $\boldsymbol{S}_j$ is a sufficient statistic under the null submodel $H_j$. Then for any selection procedure $\mathcal{R}$, and for any estimator $\psi_j:\; \boldsymbol{D} \to [0,1]$ of $\mathbf{1}\{j \in \mathcal{H}_0\}$, the estimator $\widehat{\textnormal{FDR}

Figures (15)

Figure 1: Cross-validation MSE (blue), true FDR (black), and our FDR estimator (red), for Lasso regression in two scenarios with explanatory variables that are (a) independent, and (b) highly correlated. In scenario (a), the minimum-CV model (solid blue vertical line) has high FDR, while the one-standard-error rule (dashed vertical line) achieves a reasonably low FDR. In scenario (b), there is no model that simultaneously achieves good predictive performance and low FDR. In both scenarios, our FDR estimator successfully captures important information about variable selection performance that is not evident from the CV curve.
Figure 2: $\widehat{\textnormal{FDR}}$ along with estimated one-standard-error bars by bootstrap, for four representative experiments selected from among the 16 HIV experiments in rhee2006genotypic. The vertical lines show the locations of the complexity parameters selected by CV and the one-standard-error rule.
Figure 3: Causal DAG supplied by sachs2005causal and its induced undirected graph, which we use for independent validation of our FDR estimator. A pair of proteins have an edge in the undirected graph if they are not conditionally independent given the other proteins. There are $22$ edges in the induced undirected graph, out of $55$ possible edges.
Figure 4: $\widehat{\textnormal{FDR}}$ along with estimated one-standard-error bars by bootstrap. The $\widehat{\textnormal{FDP}}$ curves are calculated by with a "ground truth" based on the dependence graph in Figure \ref{['fig:protein_UDG']}. Cross-validation alone tells little about the right way to select variables (pairs) while our $\widehat{\textnormal{FDR}}$ is informative.
Figure 5: Performance of $\widehat{\textnormal{FDR}}$ and $\widehat{\textnormal{s.e.}}$ in the Gaussian linear model. The error bars show the $5\%$ to $95\%$ quantiles of the empirical distributions and the solid lines show the sample average. $\widehat{\textnormal{FDR}}$ shows a small non-negative bias in estimating FDR and $\widehat{\textnormal{s.e.}}(\widehat{\textnormal{FDR}})$ successfully captures the magnitude of uncertainty of $\widehat{\textnormal{FDR}}$.
...and 10 more figures

Theorems & Definitions (64)

Theorem 2.1
proof
Remark 2.1
Remark 2.2
Remark 2.3
Remark 2.4
Remark 2.5
Example 2.1: Nonparametric "model-X" setting
Example 2.2: Gaussian graphical model
Proposition 3.1
...and 54 more

Estimating the False Discovery Rate of Variable Selection

TL;DR

Abstract

Estimating the False Discovery Rate of Variable Selection

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (64)