Table of Contents
Fetching ...

DART2: a robust multiple testing method to smartly leverage helpful or misleading ancillary information

Xuechan Li, Jichun Xie

TL;DR

DART2 tackles robustly leveraging ancillary information in large-scale multiple testing by a two-stage procedure that screens hypotheses through an aggregation-tree representation and then refines results on screened-out nodes. It does not rely on correct priors about ancillary information; asymptotic $FDR$ control is guaranteed, and power is at least as good as BH when ancillary information is uninformative, with potential gains when informative. The method computes node-level statistics via $T_S$ using a Stouffer aggregation across tree nodes and employs a robust refining threshold to guard against misleading ancillary information, yielding strong performance in simulations and a breast-cancer gene-expression application. Empirical results demonstrate robust $FDR$ control, improved power over competing methods, and substantial computational efficiency, highlighting DART2’s practical value for genomics and other domains with heterogeneous ancillary information.

Abstract

In many applications of multiple testing, ancillary information is available, reflecting the hypothesis null or alternative status. Several methods have been developed to leverage this ancillary information to enhance testing power, typically requiring the ancillary information is helpful enough to ensure favorable performance. In this paper, we develop a robust and effective distance-assisted multiple testing procedure named DART2, designed to be powerful and robust regardless of the quality of ancillary information. When the ancillary information is helpful, DART2 can asymptotically control FDR while improving power; otherwise, DART2 can still control FDR and maintain power at least as high as ignoring the ancillary information. We demonstrated DART2's superior performance compared to existing methods through numerical studies under various settings. In addition, DART2 has been applied to a gene association study where we have shown its superior accuracy and robustness under two different types of ancillary information.

DART2: a robust multiple testing method to smartly leverage helpful or misleading ancillary information

TL;DR

DART2 tackles robustly leveraging ancillary information in large-scale multiple testing by a two-stage procedure that screens hypotheses through an aggregation-tree representation and then refines results on screened-out nodes. It does not rely on correct priors about ancillary information; asymptotic control is guaranteed, and power is at least as good as BH when ancillary information is uninformative, with potential gains when informative. The method computes node-level statistics via using a Stouffer aggregation across tree nodes and employs a robust refining threshold to guard against misleading ancillary information, yielding strong performance in simulations and a breast-cancer gene-expression application. Empirical results demonstrate robust control, improved power over competing methods, and substantial computational efficiency, highlighting DART2’s practical value for genomics and other domains with heterogeneous ancillary information.

Abstract

In many applications of multiple testing, ancillary information is available, reflecting the hypothesis null or alternative status. Several methods have been developed to leverage this ancillary information to enhance testing power, typically requiring the ancillary information is helpful enough to ensure favorable performance. In this paper, we develop a robust and effective distance-assisted multiple testing procedure named DART2, designed to be powerful and robust regardless of the quality of ancillary information. When the ancillary information is helpful, DART2 can asymptotically control FDR while improving power; otherwise, DART2 can still control FDR and maintain power at least as high as ignoring the ancillary information. We demonstrated DART2's superior performance compared to existing methods through numerical studies under various settings. In addition, DART2 has been applied to a gene association study where we have shown its superior accuracy and robustness under two different types of ancillary information.
Paper Structure (14 sections, 5 theorems, 49 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 5 theorems, 49 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Assume the number of alternative hypothesis $m_1=O(m^{r_1})$ for some $r_1<(M^{L-1}+1)^{-1}$, DART2 with naive refining threshold $\hat{t}^*_S$ controls the FDR at any pre-specified level $\alpha\in (0,1)$, i.e., $\lim_{m,n\to\infty}\text{FDR} \leq \alpha$.

Figures (6)

  • Figure 1: An illustration example of DART2 procedure with $7$ features. (a) An aggregation tree obtained from prior knowledge, comprising $L=3$ layers with a maximum of two children allowed for each node ($M=2$); (b) Screening process embed in the aggregation tree, where each parent node is associated with a node-level hypothesis. Nodes on the aggregation tree are depicted as bins, with higher bins corresponding to nodes with a larger test statistics $T_{S}$; (c) Refining process for further selecting the features located within the nodes (bins) that were screened-out in Stage I. The rejected hypotheses on each layer is presented as $R^{(\ell)}$ in the figure, and the final rejection set is $\mathcal{R}=\{1,3,5,6\}$.
  • Figure 2: Performance of DART2 was evaluated across varying numbers of layers, with desired feature-level FDR $\alpha \in \{1\%, 5\%\}$, and misleading level $\tau \in \{0,0.2,0.4,0.6,0.8,1\}$. The bars indicate DART2's average performance (FDP and sensitivity) in testing $m=1000$ hypotheses, while the error bars the $90\%$ confidence intervals (the $5\%$ and $95\%$ quantiles) over the $200$ repetitions. The left panel shows the average FDP, with dashed horizontal lines indicating the desired FDR level $\alpha$. The right panel shows the average sensitivity.
  • Figure 3: Performance comparison of the 7-layer DART2 with the competing method under different types of testing statistics, different desired FDR level $\alpha$ and different misleading level $\tau$. The primary bars indicate the average performance over $200$ repetitions. The left panel shows the average feature-level FDP, with dashed horizontal lines indicating the desired FDR $\alpha$. The right panel shows the average feature-level sensitivity.
  • Figure 4: F1 score contour plots to compare the performance of DART2, AdaPT, DART and FDRL. The dashed line represents the F1 score contour.
  • Figure A1: Illustration of the simulated hypotheses' affiliated location and their corresponding $\eta_i$. Each hypothesis is represented by a dot. The dot color stands for the hypothesis status; and its size is $\text{L-eta}=\log(\eta_i+1)+0.01$, which is proportional to the signal strength.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Definition 1: Robust refining threshold
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • proof : Proof of Theorem \ref{['thm:indfdr']}
  • proof : Proof of Theorem \ref{['thm:robfdr']}
  • proof : Proof of Lemma \ref{['lem::tail']}
  • proof : Proof of Lemma \ref{['lem::nullsig']}