Table of Contents
Fetching ...

A robust and powerful method for assessing replicability of high dimensional data

Haochen Lei, Yan Li, Hongyuan Cao

TL;DR

A general empirical Bayes framework for multi-study replicability analysis that jointly models summary-level summary-level $p-values while explicitly accounting for between-study heterogeneity is proposed, revealing replicable genetic associations that competing approaches fail to detect.

Abstract

Identifying signals that replicate across multiple studies is essential for establishing robust scientific evidence, yet existing methods for high-dimensional replicability analysis either rely on restrictive modeling assumptions, are limited to two-study settings, or lack statistical power. We propose a general empirical Bayes framework for multi-study replicability analysis that jointly models summary-level $p$-values while explicitly accounting for between-study heterogeneity. Within each study, non-null $p$-value densities are estimated nonparametrically under monotonicity constraints, enabling flexible and tuning-free inference. For two studies, we develop a local false discovery rate (Lfdr) statistic for the composite null of non-replicability and establish identifiability, consistency, and a cubic-rate convergence of the nonparametric MLE, along with minimax optimality. Extending replicability analysis to $n$ studies typically requires estimating $2^n$ latent configurations, which is computationally infeasible. To address this challenge, we introduce a scalable pairwise rejection strategy that decomposes the exponentially large composite null into disjoint components, yielding linear complexity in the number of studies. We prove asymptotic FDR control under mild regularity conditions and show that Lfdr-based thresholding is power-optimal. Extensive simulations demonstrate that our method provides substantial power gains while maintaining valid FDR control, outperforming state-of-the-art alternatives across a wide range of scenarios. Applying our framework to East Asian- and European-ancestry genome-wide association studies of type 2 diabetes reveals replicable genetic associations that competing approaches fail to detect, illustrating the method's practical utility in large-scale biomedical research.

A robust and powerful method for assessing replicability of high dimensional data

TL;DR

A general empirical Bayes framework for multi-study replicability analysis that jointly models summary-level summary-level $p-values while explicitly accounting for between-study heterogeneity is proposed, revealing replicable genetic associations that competing approaches fail to detect.

Abstract

Identifying signals that replicate across multiple studies is essential for establishing robust scientific evidence, yet existing methods for high-dimensional replicability analysis either rely on restrictive modeling assumptions, are limited to two-study settings, or lack statistical power. We propose a general empirical Bayes framework for multi-study replicability analysis that jointly models summary-level -values while explicitly accounting for between-study heterogeneity. Within each study, non-null -value densities are estimated nonparametrically under monotonicity constraints, enabling flexible and tuning-free inference. For two studies, we develop a local false discovery rate (Lfdr) statistic for the composite null of non-replicability and establish identifiability, consistency, and a cubic-rate convergence of the nonparametric MLE, along with minimax optimality. Extending replicability analysis to studies typically requires estimating latent configurations, which is computationally infeasible. To address this challenge, we introduce a scalable pairwise rejection strategy that decomposes the exponentially large composite null into disjoint components, yielding linear complexity in the number of studies. We prove asymptotic FDR control under mild regularity conditions and show that Lfdr-based thresholding is power-optimal. Extensive simulations demonstrate that our method provides substantial power gains while maintaining valid FDR control, outperforming state-of-the-art alternatives across a wide range of scenarios. Applying our framework to East Asian- and European-ancestry genome-wide association studies of type 2 diabetes reveals replicable genetic associations that competing approaches fail to detect, illustrating the method's practical utility in large-scale biomedical research.
Paper Structure (53 sections, 25 theorems, 358 equations, 11 figures, 1 table, 2 algorithms)

This paper contains 53 sections, 25 theorems, 358 equations, 11 figures, 1 table, 2 algorithms.

Key Result

Proposition 1

(Identifiability) Let $w=(\xi_{00},\xi_{10},\xi_{01},\xi_{11},f_1,f_2)$, and $w^*=(\xi_{00}^*,\xi_{10}^*,\xi_{01}^*,\xi_{11}^*, f_1^*,f_2^*)$. Assume If $p_w(x,y)=p_{w^*}(x,y)$ almost everywhere (a.e.) on $(0, 1)^2$, then and

Figures (11)

  • Figure 1: FDR control of different methods for independent cases with two studies.
  • Figure 2: Power comparison of different methods for independent cases with two studies.
  • Figure 3: Left: Empirical versus nominal FDR for independent cases with two studies. Right: Power comparison at different nominal FDR levels.
  • Figure 4: FDR control of different methods for dependent cases with two studies.
  • Figure 5: Power comparison of different methods for dependent cases with two studies.
  • ...and 6 more figures

Theorems & Definitions (38)

  • Proposition 1
  • Proposition 2
  • Remark 1
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Theorem 3
  • Remark 2
  • Proposition 3
  • Proposition 4
  • ...and 28 more