Table of Contents
Fetching ...

Choosing the nominal level post-hoc with knockoffs using e-values

Lasse Fischer, Konstantinos Sechidis

Abstract

The knockoff filter is a powerful tool for controlled variable selection with false discovery rate (FDR) control. In this paper, we leverage e-values to allow the nominal FDR level to be switched post-hoc, after looking at the data and applying the knockoff procedure. This approach addresses a significant limitation of standard knockoffs: while frequently used in high-dimensional regressions, they often lack power in low-dimensional and sparse signal settings. One of the main reasons for this is that the knockoff filter requires a minimum number of selections that depends strictly on the nominal FDR level. By utilizing e-values, we can increase the nominal level in cases where the original procedure makes no discoveries, or decrease it to improve precision when discoveries are abundant. These improvements come without any costs, meaning the results of our post-hoc procedure are always more informative than those of the original knockoff filter. We extend this methodology to recently proposed derandomized knockoff procedures and demonstrate its utility in variable selection problems relevant to drug development using real clinical trial data.

Choosing the nominal level post-hoc with knockoffs using e-values

Abstract

The knockoff filter is a powerful tool for controlled variable selection with false discovery rate (FDR) control. In this paper, we leverage e-values to allow the nominal FDR level to be switched post-hoc, after looking at the data and applying the knockoff procedure. This approach addresses a significant limitation of standard knockoffs: while frequently used in high-dimensional regressions, they often lack power in low-dimensional and sparse signal settings. One of the main reasons for this is that the knockoff filter requires a minimum number of selections that depends strictly on the nominal FDR level. By utilizing e-values, we can increase the nominal level in cases where the original procedure makes no discoveries, or decrease it to improve precision when discoveries are abundant. These improvements come without any costs, meaning the results of our post-hoc procedure are always more informative than those of the original knockoff filter. We extend this methodology to recently proposed derandomized knockoff procedures and demonstrate its utility in variable selection problems relevant to drug development using real clinical trial data.

Paper Structure

This paper contains 34 sections, 5 theorems, 42 equations, 8 figures, 1 table, 3 algorithms.

Key Result

Theorem 2

Let $\boldsymbol{E}^{\textnormal{ph}}=(E_S^{\textnormal{ph}})_{S\subseteq [p]}$ be the family of local e-values defined in eq: local_e_ph and set Then $R^{\textnormal{ph}}\in \mathcal{R}_{\tilde{\alpha}^{\textnormal{ph}}}(\boldsymbol{E}^{\textnormal{ph}})$, where $R^{\textnormal{ph}}=\{i\in [p]:W_i\geq T_{\alpha^{\textnormal{kn}}}^{\textnormal{ph}}\}$. Therefore, it holds that Furthermore, $R^{\

Figures (8)

  • Figure 1: This figure illustrates the main contribution of our work, namely that the proposed post-hoc adjustment of the $\alpha$ values substantially increases statistical power compared to the original approach. For small values of $p_{\text{relevant}}$, the $\alpha$ level is slightly increased to obtain rejections in cases where the original method does not make any discoveries, whereas for larger $p_{\text{relevant}}$, the post-hoc method can even reduce $\alpha$ while maintaining similar (or better) power. The nominal level of $\alpha$ for the original KO method is set to $0.20.$ Results are averaged over 2000 runs, and a full description of the data‑generating process, parameter choices, and simulation protocol is provided in Section \ref{['sec:sims']}.
  • Figure 2: Performance comparison between the proposed derandomized post-hoc knockoff procedure and the original derandomized knockoffs method of candes2018panning across various numbers of actual relevant variables $p_{\text{relevant}}$. The signal amplitude is fixed at $A = 8$ and $A = 14$ for the Gaussian and Logistic model respectively. The first row reports the realized power, the second row reports the nominal level $\alpha$ (for the original method this is the user specified level $\alpha^{\textnormal{kn}}$, whereas for the post-hoc knockoff procedure it is the data-dependent level $\tilde{\alpha}^{\textnormal{ph}}$ returned by Algorithm \ref{['alg:posthoc-alpha']}) together with the estimated average FDP, and the last row reports the average ratio $\textnormal{FDP/}{\alpha}$. All values are averaged over 2000 runs.
  • Figure 3: Performance comparison between the proposed derandomized post-hoc knockoff procedure and the original derandomized knockoffs method of candes2018panning across various signal amplitude values $A$. The number of relevant variables is fixed at $p_{\text{relevant}} = 6$. The first row reports the realized power, the second row reports the nominal level $\alpha$ (for the original method this is the user specified level $\alpha^{\textnormal{kn}}$, whereas for the post-hoc knockoff procedure it is the data-dependent level $\tilde{\alpha}^{\textnormal{ph}}$ returned by Algorithm \ref{['alg:posthoc-alpha']}) together with the estimated average FDP, and the last row reports the average ratio $\textnormal{FDP/}{\alpha}$. All values are averaged over 2000 runs.
  • Figure 4: Performance comparison between the proposed post-hoc knockoff procedure and the calibration knockoffs method of luo2025improving under Gaussian and Logistic models with $p=50$. Results are reported across signal amplitudes for varying sparsity levels, given by the number of relevant variables $p_{\text{relevant}} = 10$ and $15$. The first row reports the realized power, the second row reports the nominal level $\alpha$ (for the calibration method this is the user specified level $\alpha^{\textnormal{kn}}$, whereas for the post-hoc knockoff procedure it is the data-dependent level $\tilde{\alpha}^{\textnormal{ph}}$ returned by Algorithm \ref{['alg:posthoc-alpha']}) together with the estimated average FDP, and the last row reports the average ratio $\textnormal{FDP/}{\alpha}$. All values are averaged over 2000 runs.
  • Figure 5: Performance comparison between the proposed derandomized post-hoc knockoff procedure for controlling FDR against the original derandomized knockoffs method by ren2024derandomised under Gaussian ($p=800$) and Logistic models ($p=600$). Results are reported across signal amplitudes for varying sparsity levels, given by the number of relevant variables $p_{\text{relevant}}$, including the original setting: $p_{\text{relevant}}=80$ for the gaussian, and $p_{\text{relevant}}=60$ for the logistic. The first row reports the realized power, the second row reports the nominal level $\alpha$ (for the calibration method this is the user specified level $\alpha^{\textnormal{ebh}}$, whereas for the post-hoc knockoff procedure it is the data-dependent level $\tilde{\alpha}^{\textnormal{dph}}$ returned by Algorithm \ref{['alg:posthoc-alpha_derand']}) together with the estimated average FDP, and the last row reports the average ratio $\textnormal{FDP/}{\alpha}$. All values are averaged over 200 runs.
  • ...and 3 more figures

Theorems & Definitions (11)

  • Example 1
  • Remark 1
  • Theorem 2
  • Remark 3
  • Theorem 4
  • Remark 5
  • Proposition 6
  • Example 2
  • Theorem 7
  • Remark 8
  • ...and 1 more