Variable selection via thresholding
Ka Long Keith Ho, Hien Duy Nguyen
TL;DR
This work develops a thresholding-based framework for variable selection in linear regression that converts any convergent estimator into a consistent selector of the relevant variable set. By pairing a multi-thresholding transform with an information-criterion-like penalty (SWIC), the authors identify an index $k_0$ that yields consistent exclusion of irrelevant variables and produce a hard-thresholded estimator with $\sqrt{n}$-consistency and oracle-type properties for the relevant coordinates. Theoretical results rely on empirical-process techniques to establish weak convergence of the thresholded empirical risk and a delta-method argument to propagate asymptotics to the thresholded estimator. Numerical experiments, using inputs from OLS, ridge, and adaptive ridge, plus a prostate cancer dataset, demonstrate effective variable screening and sparse estimation, with performance improving as sample size grows and penalties are tuned.
Abstract
Variable selection comprises an important step in many modern statistical inference procedures. In the regression setting, when estimators cannot shrink irrelevant signals to zero, covariates without relationships to the response often manifest small but non-zero regression coefficients. The ad hoc procedure of discarding variables whose coefficients are smaller than some threshold is often employed in practice. We formally analyze a version of such thresholding procedures and develop a simple thresholding method that consistently estimates the set of relevant variables under mild regularity assumptions. Using this thresholding procedure, we propose a sparse, $\sqrt{n}$-consistent and asymptotically normal estimator whose non-zero elements do not exhibit shrinkage. The performance and applicability of our approach are examined via numerical studies of simulated and real data.
