Table of Contents
Fetching ...

Controlling the false discovery rate in high-dimensional linear models using model-X knockoffs and $p$-values

Jinyuan Chang, Chenlong Li, Cheng Yong Tang, Zhengtian Zhu

TL;DR

The paper addresses FDR control in high-dimensional linear models by integrating model-X knockoffs with debiased penalized regression to produce valid $p$-values. It develops two paired test-statistic streams $(t_{1,j}, t_{2,j})$ from a debiased augmented model and proves asymptotic normality and independence, enabling both the standard BH procedure and a two-step Bonferroni–BH approach to improve power. The authors establish rigorous FDR control under unknown dependence and demonstrate superior power, particularly in low-signal, small-sample settings, through extensive simulations and a real-data HIV mutation analysis. The methodology relies on CLIME for the precision matrix estimation and scaled Lasso for variance, with a debiased estimator built on the augmented design $Z=(X, ilde X)$, and is complemented by practical guidance and public code. Collectively, the work offers a principled, scalable framework for reliable variable selection in high-dimensional inference where dependence structures are complex and not fully known.

Abstract

In this paper, we propose novel multiple testing methods for controlling the false discovery rate (FDR) in the context of high-dimensional linear models. Our development innovatively integrates model-X knockoff techniques with debiased penalized regression estimators. The proposed approach addresses two fundamental challenges in high-dimensional statistical inference: (i) constructing valid test statistics and corresponding $p$-values in solving problems with a diverging number of model parameters, and (ii) ensuring FDR control under complex and unknown dependence structures among test statistics. A central contribution of our methodology lies in the rigorous construction and theoretical analysis of two paired sets of test statistics. Based on these test statistics, our methodology adopts two $p$-value-based multiple testing algorithms. The first applies the conventional Benjamini-Hochberg procedure, justified by the asymptotic mutual independence and normality of one set of the test statistics. The second leverages the paired structure of both sets of test statistics to improve detection power while maintaining rigorous FDR control. We provide comprehensive theoretical analysis, establishing the validity of the debiasing framework and ensuring that the proposed methods achieve proper FDR control. Extensive simulation studies demonstrate that our procedures outperform existing approaches - particularly those relying on empirical evaluations of false discovery proportions - in terms of both power and empirical control of the FDR. Notably, our methodology yields substantial improvements in settings characterized by weaker signals, smaller sample sizes, and lower pre-specified FDR levels.

Controlling the false discovery rate in high-dimensional linear models using model-X knockoffs and $p$-values

TL;DR

The paper addresses FDR control in high-dimensional linear models by integrating model-X knockoffs with debiased penalized regression to produce valid -values. It develops two paired test-statistic streams from a debiased augmented model and proves asymptotic normality and independence, enabling both the standard BH procedure and a two-step Bonferroni–BH approach to improve power. The authors establish rigorous FDR control under unknown dependence and demonstrate superior power, particularly in low-signal, small-sample settings, through extensive simulations and a real-data HIV mutation analysis. The methodology relies on CLIME for the precision matrix estimation and scaled Lasso for variance, with a debiased estimator built on the augmented design , and is complemented by practical guidance and public code. Collectively, the work offers a principled, scalable framework for reliable variable selection in high-dimensional inference where dependence structures are complex and not fully known.

Abstract

In this paper, we propose novel multiple testing methods for controlling the false discovery rate (FDR) in the context of high-dimensional linear models. Our development innovatively integrates model-X knockoff techniques with debiased penalized regression estimators. The proposed approach addresses two fundamental challenges in high-dimensional statistical inference: (i) constructing valid test statistics and corresponding -values in solving problems with a diverging number of model parameters, and (ii) ensuring FDR control under complex and unknown dependence structures among test statistics. A central contribution of our methodology lies in the rigorous construction and theoretical analysis of two paired sets of test statistics. Based on these test statistics, our methodology adopts two -value-based multiple testing algorithms. The first applies the conventional Benjamini-Hochberg procedure, justified by the asymptotic mutual independence and normality of one set of the test statistics. The second leverages the paired structure of both sets of test statistics to improve detection power while maintaining rigorous FDR control. We provide comprehensive theoretical analysis, establishing the validity of the debiasing framework and ensuring that the proposed methods achieve proper FDR control. Extensive simulation studies demonstrate that our procedures outperform existing approaches - particularly those relying on empirical evaluations of false discovery proportions - in terms of both power and empirical control of the FDR. Notably, our methodology yields substantial improvements in settings characterized by weaker signals, smaller sample sizes, and lower pre-specified FDR levels.

Paper Structure

This paper contains 22 sections, 3 theorems, 24 equations, 21 figures, 2 algorithms.

Key Result

Theorem 1

Let Conditions ass:model_error--ass:CLIME hold and $|\boldsymbol{\gamma}_{0}|_{0}\le s_0$ for some integer $1\le s_0<2d$. For any given $\tau>0$ specified in Condition ass:CLIME, let $\hat{\boldsymbol{\gamma}}$ be the Lasso estimator given in eq:lasso with $\varrho_1$ satisfying $\varrho_1 \ge 4\sig (ii) If $n \ge 6(c_{2}+2)\log (2d)$, then

Figures (21)

  • Figure 1: Simulated false discovery rate (FDR) when all null hypotheses are true, for setting 1 (left column), setting 2 (middle column), and setting 3 (right column). The sample sizes of top row and bottom row are $n = 200$ and $n = 500$, respectively. The FDR level is $\alpha = 0.1$. The methods compared are Algorithm 1 (squares and red solid line), Algorithm 2 (circles and green solid line), the knockoff-based method of candes2018panning (triangles and blue dotted line), and the Gaussian Mirror method of Xing2021Controlling with FDP+ procedure (diamonds and purple dashed line).
  • Figure 2: Simulated FDR and power for the settings of $n = 200$ and $d = 200$ (left column), $n = 200$ and $d = 300$ (middle column), $n = 200$ and $d = 400$ (right column). The rows of the design matrix were generated from setting 2. The sparsity level is $k = 0.04d$ and the FDR level is $\alpha = 0.1$. The methods compared are Algorithm 1 (squares and red solid line), Algorithm 2 (circles and yellow solid line), the knockoff-based method of candes2018panning (triangles and green dotted line), the Gaussian Mirror method of Xing2021Controlling (diamonds and blue dashed line), and the Gaussian Mirror method with FDP+ procedure (squares and purple dashed line).
  • Figure 3: Simulated FDR and power for the settings of $n = 400$ and $d = 400$ (left column), $n = 600$ and $d = 600$ (middle column), $n = 800$ and $d = 800$ (right column). The rows of the design matrix were generated from setting 2. The sparsity level is $k = 15$ and the FDR level is $\alpha = 0.1$. The methods compared are Algorithm 1 (squares and green dotted line), two-stage Algorithm 1 (squares and red solid line), Algorithm 2 (circles and blue dotted line), two-stage Algorithm 2 (circles and yellow solid line), the Gaussian Mirror method of Xing2021Controlling (triangles and blue dashed line), and the Gaussian Mirror method with FDP+ procedure (triangles and purple two-dashed line).
  • Figure 4: Simulated FDR and power for the settings of $n = 300$ and $d = 100$ (left column), $n = 500$ and $d = 200$ (middle column), $n = 700$ and $d = 300$ (right column). The rows of the design matrix were generated from setting 2. The sparsity level is $k = 0.1d$ and the FDR level is $\alpha = 0.1$. The methods compared are Algorithm 1 (squares and red solid line), Algorithm 2 (circles and green solid line), and the Bonferroni-Benjamini-Hochberg method of Sarkar2022Adjusting (triangles and blue dotted line).
  • Figure 5: Results of the real data example for $\alpha = 0.2$. Blue represents the number of discoveries that are in the treatment-selected mutation panels list, and yellow represents the number of discoveries not in the treatment-selected mutation panels list. The total number of HIV-1 protease positions in the treatment-selected mutation panels list is $34$. The methods compared are the proposed Algorithm 1 (Algorithm1), the proposed Algorithm 2 (Algorithm2), the Gaussian Mirror method of Xing2021Controlling (GM), the knockoff-based method of Barber2015Controlling (Knockoff), and the Bonferroni-Benjamini-Hochberg method of Sarkar2022Adjusting (B-BH).
  • ...and 16 more figures

Theorems & Definitions (5)

  • Remark 1
  • Remark 2
  • Theorem 1
  • Theorem 2
  • Theorem 3