Thresholded Lasso for high dimensional variable selection
Shuheng Zhou
TL;DR
This work develops and analyzes the Thresholded Lasso for high-dimensional linear regression when $n \ll p$. The method combines an initial Lasso fit with a data-driven thresholding step and an OLS refit on the selected indices, achieving sparse oracle-type $\ell_2$ loss under Restricted Eigenvalue and related sparse-eigenvalue conditions, without requiring a strong $\beta_{\min}$ assumption. It also extends the Gauss-Dantzig approach under Uniform Uncertainty Principle and provides detailed proof sketches and extensive numerical validation showing near-optimal support recovery and favorable error rates. The results offer a practical, robust framework for simultaneous variable selection and estimation in ultrahigh dimensions, with explicit error bounds that adapt to the underlying sparsity pattern. Overall, the Thresholded Lasso provides a theoretically justified, implementable approach that closely matches oracle performance while keeping the selected model compact, even when many weak signals are present.
Abstract
Given $n$ noisy samples with $p$ dimensions, where $n \ll p$, we show that the multi-step thresholding procedure based on the Lasso -- we call it the {\it Thresholded Lasso}, can accurately estimate a sparse vector $β\in {\mathbb R}^p$ in a linear model $Y = X β+ ε$, where $X_{n \times p}$ is a design matrix normalized to have column $\ell_2$-norm $\sqrt{n}$, and $ε\sim N(0, σ^2 I_n)$. We show that under the restricted eigenvalue (RE) condition, it is possible to achieve the $\ell_2$ loss within a logarithmic factor of the ideal mean square error one would achieve with an $oracle$ while selecting a sufficiently sparse model -- hence achieving $sparse \ oracle \ inequalities$; the oracle would supply perfect information about which coordinates are non-zero and which are above the noise level. We also show for the Gauss-Dantzig selector (Candès-Tao 07), if $X$ obeys a uniform uncertainty principle, one will achieve the sparse oracle inequalities as above, while allowing at most $s_0$ irrelevant variables in the model in the worst case, where $s_0 \leq s$ is the smallest integer such that for $λ= \sqrt{2 \log p/n}$, $\sum_{i=1}^p \min(β_i^2, λ^2 σ^2) \leq s_0 λ^2 σ^2$. Our simulation results on the Thresholded Lasso match our theoretical analysis excellently.
