Estimating the False Discovery Rate of Variable Selection
Yixiang Luo, William Fithian, Lihua Lei
TL;DR
This work presents a general-purpose estimator for the false discovery rate of any variable selection procedure, valid across Gaussian linear models, Gaussian graphical models, and model-X settings. The estimator decomposes FDR into per-variable terms, uses Rao-Blackwellization to obtain unbiased components under the null, and applies a normalization via p-values to produce a conservative overall estimate. It is complemented by a bootstrap-based standard error procedure and supported by theoretical variance bounds and asymptotic bootstrap validity in a Gaussian-linear regime with block-orthogonal design. Real-data applications (HIV drug resistance and protein signaling) alongside comprehensive simulations demonstrate that the estimator provides actionable insight into the trade-off between predictive accuracy and variable-selection accuracy, often surpassing what cross-validation alone conveys. The accompanying software (R package hFDR) enables practitioners to apply the method to diverse problems and model choices with interpretable FDR guidance along the model complexity path.
Abstract
We introduce a generic estimator for the false discovery rate of any model selection procedure, in common statistical modeling settings including the Gaussian linear model, Gaussian graphical model, and model-X setting. We prove that our method has a conservative (non-negative) bias in finite samples under standard statistical assumptions, and provide a bootstrap method for assessing its standard error. For methods like the Lasso, forward-stepwise regression, and the graphical Lasso, our estimator serves as a valuable companion to cross-validation, illuminating the tradeoff between prediction error and variable selection accuracy as a function of the model complexity parameter.
