A robust and powerful method for assessing replicability of high dimensional data

Haochen Lei; Yan Li; Hongyuan Cao

A robust and powerful method for assessing replicability of high dimensional data

Haochen Lei, Yan Li, Hongyuan Cao

TL;DR

A general empirical Bayes framework for multi-study replicability analysis that jointly models summary-level summary-level $p-values while explicitly accounting for between-study heterogeneity is proposed, revealing replicable genetic associations that competing approaches fail to detect.

Abstract

Identifying signals that replicate across multiple studies is essential for establishing robust scientific evidence, yet existing methods for high-dimensional replicability analysis either rely on restrictive modeling assumptions, are limited to two-study settings, or lack statistical power. We propose a general empirical Bayes framework for multi-study replicability analysis that jointly models summary-level $p$-values while explicitly accounting for between-study heterogeneity. Within each study, non-null $p$-value densities are estimated nonparametrically under monotonicity constraints, enabling flexible and tuning-free inference. For two studies, we develop a local false discovery rate (Lfdr) statistic for the composite null of non-replicability and establish identifiability, consistency, and a cubic-rate convergence of the nonparametric MLE, along with minimax optimality. Extending replicability analysis to $n$ studies typically requires estimating $2^n$ latent configurations, which is computationally infeasible. To address this challenge, we introduce a scalable pairwise rejection strategy that decomposes the exponentially large composite null into disjoint components, yielding linear complexity in the number of studies. We prove asymptotic FDR control under mild regularity conditions and show that Lfdr-based thresholding is power-optimal. Extensive simulations demonstrate that our method provides substantial power gains while maintaining valid FDR control, outperforming state-of-the-art alternatives across a wide range of scenarios. Applying our framework to East Asian- and European-ancestry genome-wide association studies of type 2 diabetes reveals replicable genetic associations that competing approaches fail to detect, illustrating the method's practical utility in large-scale biomedical research.

A robust and powerful method for assessing replicability of high dimensional data

TL;DR

Abstract

-values while explicitly accounting for between-study heterogeneity. Within each study, non-null

-value densities are estimated nonparametrically under monotonicity constraints, enabling flexible and tuning-free inference. For two studies, we develop a local false discovery rate (Lfdr) statistic for the composite null of non-replicability and establish identifiability, consistency, and a cubic-rate convergence of the nonparametric MLE, along with minimax optimality. Extending replicability analysis to

studies typically requires estimating

latent configurations, which is computationally infeasible. To address this challenge, we introduce a scalable pairwise rejection strategy that decomposes the exponentially large composite null into disjoint components, yielding linear complexity in the number of studies. We prove asymptotic FDR control under mild regularity conditions and show that Lfdr-based thresholding is power-optimal. Extensive simulations demonstrate that our method provides substantial power gains while maintaining valid FDR control, outperforming state-of-the-art alternatives across a wide range of scenarios. Applying our framework to East Asian- and European-ancestry genome-wide association studies of type 2 diabetes reveals replicable genetic associations that competing approaches fail to detect, illustrating the method's practical utility in large-scale biomedical research.

Paper Structure (53 sections, 25 theorems, 358 equations, 11 figures, 1 table, 2 algorithms)

This paper contains 53 sections, 25 theorems, 358 equations, 11 figures, 1 table, 2 algorithms.

Introduction
Two study case
Notation and problem setup
Parameter estimation and FDR control algorithmt
Identifiability
Consistency
Optimal minimax rate
Theoretical guarantee of Algorithm \ref{['alg1']}
Oracle power
Multi-study extension
Oracle case
Parameter estimation
FDR control
Simulation studies
Independent case with two studies
...and 38 more sections

Key Result

Proposition 1

(Identifiability) Let $w=(\xi_{00},\xi_{10},\xi_{01},\xi_{11},f_1,f_2)$, and $w^*=(\xi_{00}^*,\xi_{10}^*,\xi_{01}^*,\xi_{11}^*, f_1^*,f_2^*)$. Assume If $p_w(x,y)=p_{w^*}(x,y)$ almost everywhere (a.e.) on $(0, 1)^2$, then and

Figures (11)

Figure 1: FDR control of different methods for independent cases with two studies.
Figure 2: Power comparison of different methods for independent cases with two studies.
Figure 3: Left: Empirical versus nominal FDR for independent cases with two studies. Right: Power comparison at different nominal FDR levels.
Figure 4: FDR control of different methods for dependent cases with two studies.
Figure 5: Power comparison of different methods for dependent cases with two studies.
...and 6 more figures

Theorems & Definitions (38)

Proposition 1
Proposition 2
Remark 1
Theorem 1
Corollary 1
Theorem 2
Theorem 3
Remark 2
Proposition 3
Proposition 4
...and 28 more

A robust and powerful method for assessing replicability of high dimensional data

TL;DR

Abstract

A robust and powerful method for assessing replicability of high dimensional data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (38)