Table of Contents
Fetching ...

Empirical Bayes Variable Selection with Lasso Statistics in the AMP Framework

Lina Hidmi, Asaf Weinstein

TL;DR

This work addresses variable selection in high-dimensional Gaussian linear models by reframing the problem through the AMP framework and a two-groups model, enabling an empirical Bayes procedure that orders variables by an estimated local false discovery rate. The authors prove that, for fixed $\lambda$, the EB method achieves the same asymptotic FDP-TPP trade-off as an oracle lfdr-based rule, and they show that the optimal $\lambda$ coincides with the minimizer of the asymptotic mean squared error, justifying cross-validated selection of $\lambda$. Theoretical results are complemented by simulations demonstrating substantial power gains over Lasso and thresholded-Lasso while maintaining asymptotic error control, with a practical lfdr estimation framework that extends beyond pure power considerations. Overall, the paper contributes an instance-optimal EB approach within AMP for variable selection and provides concrete guidance on density-estimation-based lfdr implementation and parameter tuning for improved high-dimensional inference.

Abstract

The Lasso is one of the most ubiquitous methods for variable selection in high-dimensional linear regression and has been studied extensively under different regimes. In a particular asymptotic setup entailing $n/p\to \text{constant}$, an i.i.d.~Gaussian $X$ matrix and linear sparsity, \citet{su2017false} analyzed the Lasso selection path and presented negative results, showing that maintaining small levels of the false discovery proportion comes at a substantial cost in power. Followup work by \citet{wang2020bridge} used the same framework to study the tradeoff between type I error and power for thresholded-Lasso selection, which ranks the variables based on the magnitude of the Lasso estimate instead of the order of appearance on the Lasso path, and demonstrated that significant improvements are possible if the regularization parameter is chosen appropriately. We take this line of research a step further, seeking an {\em optimal} selection procedure in the AMP framework among procedures that order the variables by some univariate function of the Lasso estimate at a fixed value $λ$ of the regularization term. Observing that the model for the Lasso estimates effectively reduces asymptotically to a version of the well-studied two-groups model, we propose an empirical Bayes variable selection procedure based on an estimate of the local false discovery rate. We extend existing results in the AMP framework to obtain exact predictions for the curve describing the asymptotic tradeoff between type I error and power of this procedure. Additionally, we prove that the optimal $λ$ is the minimizer of the asymptotic mean squared error, and accordingly propose to use the empirical Bayes procedure with $λ$ estimated by cross-validation. The theoretical predictions imply that the gains in power can be substantial, and we confirm this by numerical studies under different settings.

Empirical Bayes Variable Selection with Lasso Statistics in the AMP Framework

TL;DR

This work addresses variable selection in high-dimensional Gaussian linear models by reframing the problem through the AMP framework and a two-groups model, enabling an empirical Bayes procedure that orders variables by an estimated local false discovery rate. The authors prove that, for fixed , the EB method achieves the same asymptotic FDP-TPP trade-off as an oracle lfdr-based rule, and they show that the optimal coincides with the minimizer of the asymptotic mean squared error, justifying cross-validated selection of . Theoretical results are complemented by simulations demonstrating substantial power gains over Lasso and thresholded-Lasso while maintaining asymptotic error control, with a practical lfdr estimation framework that extends beyond pure power considerations. Overall, the paper contributes an instance-optimal EB approach within AMP for variable selection and provides concrete guidance on density-estimation-based lfdr implementation and parameter tuning for improved high-dimensional inference.

Abstract

The Lasso is one of the most ubiquitous methods for variable selection in high-dimensional linear regression and has been studied extensively under different regimes. In a particular asymptotic setup entailing , an i.i.d.~Gaussian matrix and linear sparsity, \citet{su2017false} analyzed the Lasso selection path and presented negative results, showing that maintaining small levels of the false discovery proportion comes at a substantial cost in power. Followup work by \citet{wang2020bridge} used the same framework to study the tradeoff between type I error and power for thresholded-Lasso selection, which ranks the variables based on the magnitude of the Lasso estimate instead of the order of appearance on the Lasso path, and demonstrated that significant improvements are possible if the regularization parameter is chosen appropriately. We take this line of research a step further, seeking an {\em optimal} selection procedure in the AMP framework among procedures that order the variables by some univariate function of the Lasso estimate at a fixed value of the regularization term. Observing that the model for the Lasso estimates effectively reduces asymptotically to a version of the well-studied two-groups model, we propose an empirical Bayes variable selection procedure based on an estimate of the local false discovery rate. We extend existing results in the AMP framework to obtain exact predictions for the curve describing the asymptotic tradeoff between type I error and power of this procedure. Additionally, we prove that the optimal is the minimizer of the asymptotic mean squared error, and accordingly propose to use the empirical Bayes procedure with estimated by cross-validation. The theoretical predictions imply that the gains in power can be substantial, and we confirm this by numerical studies under different settings.
Paper Structure (8 sections, 18 theorems, 191 equations, 5 figures)

This paper contains 8 sections, 18 theorems, 191 equations, 5 figures.

Key Result

Theorem 1

Under the asymptotic setup described above, let $\hat{\beta} = \hat{\beta}(\lambda)$ denote the Lasso estimator eq:lasso-estimator. Let $\psi: \mathbb{R} \times \mathbb{R} \to \mathbb{R}$ be a pseudo-Lipschitz function. Then as $p\to \infty$, where $Z \sim \mathcal{N}(0,1)$ is independent of $\Pi$, and the parameters $\tau > 0$ and $\alpha > \alpha_{\min}$ are the unique solution to the system of

Figures (5)

  • Figure 1: Tradeoff curves for the proposed EB method ("EB-Lasso" in the legend) compared to Lasso and thresholded-Lasso selection, for an example with $p=10^4, n=2p$, and $\beta_i\sim 0.9\delta_0 + 0.1\mathcal{N}(3.5, 1)$. Black curves are theoretical asymptotic predictions, thin grey lines are corresponding simulation results.
  • Figure 2: Empirical tradeoff curves vs. theoretical predictions for the oracle and the EB procedure. Left column displays results for $\delta=0.5$, right column for $\delta=1.8$. The rows correspond to different choices of $\Pi_1$. Thin lines are realized FDP and TPP from 17 simulation runs, red for the oracle and blue for the EB procedure. Solid black lines represent theoretical predictions $\text{fdp}^*(t), \text{tpp}^*(t)$. Broken black line is the theoretical curve for thresholded-Lasso. Further details are included in the main text.
  • Figure 3: Left: asymptotic FDP vs. $\lambda$ when fixing the asymptotic TPP at $0.7$, for a setting with $\epsilon = 0.1, \delta = 1, \sigma =1$ and $\Pi_1 = .2\mathcal{N}(-3.6, 1) + .8\mathcal{N}(4,1)$. Right: simulation mean and standard error of $\hat{\lambda}_\text{cv}$ for different $p$.
  • Figure 4: Asymptotic FDP of the EB procedure when using $\lambda^*_{cv}$ vs. the optimal value $\lambda^*$ for $\delta=1$ and four different choices of $\Pi_1$. In each panel, every point corresponds to some fixed value $\xi$ of the (asymptotic) TPP level. The identity line is shown for reference.
  • Figure 5: Estimation of the local FDR in an example with $\epsilon = 0.1, \delta = 1, \sigma =1$, $\Pi_1 = .2\mathcal{N}(-3.6, 1) + .8\mathcal{N}(4,1)$, and $\lambda=1$ used in calculating the Lasso estimates. Solid curves are theoretical asymptotic predictions of the true lfdr and its estimate. Grey lines represent realized FDP in a small moving window from 17 independent simulation runs with n=p= 14000, as described in the main text.

Theorems & Definitions (35)

  • Theorem : Theorem 1.5 of bayati2012lasso
  • Lemma 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Lemma 4.4
  • Proposition 4.5
  • Proposition 4.6
  • Remark 4.7
  • Theorem 4.8
  • Remark 4.9
  • ...and 25 more