Thresholded Lasso for high dimensional variable selection

Shuheng Zhou

Thresholded Lasso for high dimensional variable selection

Shuheng Zhou

TL;DR

This work develops and analyzes the Thresholded Lasso for high-dimensional linear regression when $n \ll p$. The method combines an initial Lasso fit with a data-driven thresholding step and an OLS refit on the selected indices, achieving sparse oracle-type $\ell_2$ loss under Restricted Eigenvalue and related sparse-eigenvalue conditions, without requiring a strong $\beta_{\min}$ assumption. It also extends the Gauss-Dantzig approach under Uniform Uncertainty Principle and provides detailed proof sketches and extensive numerical validation showing near-optimal support recovery and favorable error rates. The results offer a practical, robust framework for simultaneous variable selection and estimation in ultrahigh dimensions, with explicit error bounds that adapt to the underlying sparsity pattern. Overall, the Thresholded Lasso provides a theoretically justified, implementable approach that closely matches oracle performance while keeping the selected model compact, even when many weak signals are present.

Abstract

Given $n$ noisy samples with $p$ dimensions, where $n \ll p$, we show that the multi-step thresholding procedure based on the Lasso -- we call it the {\it Thresholded Lasso}, can accurately estimate a sparse vector $β\in {\mathbb R}^p$ in a linear model $Y = X β+ ε$, where $X_{n \times p}$ is a design matrix normalized to have column $\ell_2$-norm $\sqrt{n}$, and $ε\sim N(0, σ^2 I_n)$. We show that under the restricted eigenvalue (RE) condition, it is possible to achieve the $\ell_2$ loss within a logarithmic factor of the ideal mean square error one would achieve with an $oracle$ while selecting a sufficiently sparse model -- hence achieving $sparse \ oracle \ inequalities$; the oracle would supply perfect information about which coordinates are non-zero and which are above the noise level. We also show for the Gauss-Dantzig selector (Candès-Tao 07), if $X$ obeys a uniform uncertainty principle, one will achieve the sparse oracle inequalities as above, while allowing at most $s_0$ irrelevant variables in the model in the worst case, where $s_0 \leq s$ is the smallest integer such that for $λ= \sqrt{2 \log p/n}$, $\sum_{i=1}^p \min(β_i^2, λ^2 σ^2) \leq s_0 λ^2 σ^2$. Our simulation results on the Thresholded Lasso match our theoretical analysis excellently.

Thresholded Lasso for high dimensional variable selection

TL;DR

This work develops and analyzes the Thresholded Lasso for high-dimensional linear regression when

. The method combines an initial Lasso fit with a data-driven thresholding step and an OLS refit on the selected indices, achieving sparse oracle-type

loss under Restricted Eigenvalue and related sparse-eigenvalue conditions, without requiring a strong

assumption. It also extends the Gauss-Dantzig approach under Uniform Uncertainty Principle and provides detailed proof sketches and extensive numerical validation showing near-optimal support recovery and favorable error rates. The results offer a practical, robust framework for simultaneous variable selection and estimation in ultrahigh dimensions, with explicit error bounds that adapt to the underlying sparsity pattern. Overall, the Thresholded Lasso provides a theoretically justified, implementable approach that closely matches oracle performance while keeping the selected model compact, even when many weak signals are present.

Abstract

Given

noisy samples with

dimensions, where

, we show that the multi-step thresholding procedure based on the Lasso -- we call it the {\it Thresholded Lasso}, can accurately estimate a sparse vector

in a linear model

, where

is a design matrix normalized to have column

-norm

, and

. We show that under the restricted eigenvalue (RE) condition, it is possible to achieve the

loss within a logarithmic factor of the ideal mean square error one would achieve with an

while selecting a sufficiently sparse model -- hence achieving

; the oracle would supply perfect information about which coordinates are non-zero and which are above the noise level. We also show for the Gauss-Dantzig selector (Candès-Tao 07), if

obeys a uniform uncertainty principle, one will achieve the sparse oracle inequalities as above, while allowing at most

irrelevant variables in the model in the worst case, where

is the smallest integer such that for

. Our simulation results on the Thresholded Lasso match our theoretical analysis excellently.

Paper Structure (41 sections, 20 theorems, 212 equations, 8 figures, 4 tables)

This paper contains 41 sections, 20 theorems, 212 equations, 8 figures, 4 tables.

Introduction
Sparse oracle inequalities
The Thresholded Lasso estimator
The thresholding rules
Discussions
Background and related work
Proof sketch for the main result
Proof sketch of Theorems \ref{['thm::RE-oracle-main']} and \ref{['thm::RE-oracle']}
On Type II errors and $\ell_2$-loss optimality
Discussions
Model specification
Conditions in ZH08
Variable selection in $A_0$
Numerical results
$\ell_1$ and $\ell_2$ error bounds for $\beta_{\text{\rm init}}$
...and 26 more sections

Key Result

Theorem 2.1

(Ideal model selection for the Thresholded Lasso) Suppose $\beta \in \mathbb R^p$ is $s$-sparse. Let $s_0$ be as in eq::define-s0. Let $Y = X \beta + \epsilon$, where $\epsilon =(\epsilon_1, \ldots, \epsilon_n)^T$ is a vector containing independent and identically distributed (i.i.d.) noise with $\e

Figures (8)

Figure 1: In this model, the component $\beta^{(11)}$ has $a_0$ non-zero coordinates with the same magnitude $C_a \lambda \sigma =:\beta_{\min, A_0}$, where $C_a \in \{1.706, 8.528\}$ and $\beta_{\min, A_0} \in \{0.2, 1\}$; the component $\beta^{(12)}$ has $s_0 - a_0$ non-zero coordinates with the same magnitude $C_m \lambda \sigma$, where $C_m = 1/{\sqrt{2}}$ for $s> s_0$ and $C_m=1$ in case $s_0 = s$; the component $\beta^{(2)}$ has $s - s_0$ non-zero coordinates with the same magnitude $C_t \lambda \sigma =: c_t \sigma/\sqrt{n}$. See \ref{['eq::tailcount']}. The rest are all 0s. In the exact sparse case, namely, when $s=s_0$, all non-zero signals are concentrated on the component $\beta^{(1)}$ without spreading across components of $\beta^{(2)}$.
Figure 2: $p=2048, n=1600$. Left column: $\left\lVert h_{T_0^c}\right\rVert_1$, $\left\lVert h_{T_0}\right\rVert_1$, and $\left\lVert h\right\rVert_1$ as Lasso penalty ($f_p$) increases across different sparsity $s \in \{130, 370, 511\}$. Right column: plots of $\left\lVert h_{T_0}\right\rVert_2$ and $\left\lVert\delta\right\rVert_2$. In the top panel, we fix $\gamma = 0.3$, and compare two cases of $C_a \lambda \sigma \in \{0.2, 1 \}$. In the middle panel, we fix $C_a \lambda \sigma = 1$ and compare two cases of $\gamma \in \{0.3, 0.7\}$. In the bottom panel, we zoom in on one case with $\gamma=0.7, C_a \lambda \sigma = 0.2$, and we plot $\left\lVert\delta\right\rVert_1$ together with $\left\lVert\delta\right\rVert_2$ in the bottom right panel.
Figure 3: $p=2048, n=1600, \gamma=0.7$. Plots of model size ($|I|$), number of TPs and FPs, as threshold increases. Note $|I|=$ TPs + FPs. In (a) and (b), Lasso penalty factor $f_p=0.3$ is fixed, and in panel (a) $s \in \{130, 511, 710\}$, and in panel (b) $s \in \{50, 130\}$. In panels (c) and (d), we plot the same metrics across different $f_p \in \{0.1, 0.3, 0.7\}$ with fixed $s=130$. In all panels, the 3 dotted vertical lines from left to right represent $C_m \lambda \sigma / 2, C_m \lambda\sigma$ and $\lambda\sigma$. The model size remains invariant and hence the diagonal dashed lines all stay flat for $\lambda\sigma < t_0 \le 2 \lambda \sigma$ for $\beta_{\min, A_0}=1$.
Figure 4: $p=2048, n=1600$, $s =130$. Plots of $\left\lVert\hat{\beta}^{\mathop{\text{\rm ols}}}(I) - \beta\right\rVert_2$, for $C_a \lambda \sigma \in \{0.2, 1\}$ and $\gamma \in \{0.3, 0.7\}$. The horizontal lines correspond to the $\ell_2$-norm error of Lasso estimate $\beta_\text{\rm init}$, namely, $\left\lVert\delta\right\rVert_2$.
Figure 5: Illustrative example: i.i.d. Gaussian ensemble; $p=256$, $n=72$, $s=8$, and $\sigma = \sqrt{s}/3$. (a) compare with the Lasso estimator $\tilde{\beta}$ which minimizes $\ell_2$ loss. Here $\tilde{\beta}$ has only 3 FPs, but $\rho^2$ is large with a value of $64.73$. (b) Compare with the $\beta_{\text{\rm init}}$ obtained using $\lambda_n$. The dotted lines show the thresholding level $t_0$. The $\beta_{\text{\rm init}}$ has 15 FPs, all of which were cut after the 2nd step; resulting $\rho^2= 12.73$. After refitting with OLS in the 3rd step, for the $\hat{\beta}$, $\rho^2$ is further reduced to $0.51$.
...and 3 more figures

Theorems & Definitions (31)

Theorem 2.1
Remark 2.2
Lemma 2.3
Theorem 2.4
Lemma 2.5
Remark 2.6
Lemma 2.7
Lemma 2.8
Remark 2.9
Remark 2.10
...and 21 more

Thresholded Lasso for high dimensional variable selection

TL;DR

Abstract

Thresholded Lasso for high dimensional variable selection

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (31)