How to Tell When a Result Will Replicate: Significance and Replication in Distributional Null Hypothesis Tests
Fintan Costello, Paul Watts
TL;DR
The paper tackles the replication crisis by introducing a distributional null hypothesis that incorporates both within- and between-experiment variance. It develops a mathematically coherent framework for significance and replication (p_sig and p_rep) using a $t$-distribution with a variance ratio $b$, and provides practical closed-form estimators and bounds for everyday use. Empirical validation on the Many Labs 1 dataset shows that many results deemed significant under point-form tests are compatible with random cross-experiment variation, and that predicted replication probabilities align closely with observed replication rates. The approach offers a principled, conservative alternative to traditional significance testing and complements Bayesian methods (e.g., JZS $t$ test) by focusing on replication risk and cross-experiment heterogeneity, with broad applicability to $t$, regression, correlation, and contingency analyses.
Abstract
There is a well-known problem in Null Hypothesis Significance Testing: many statistically significant results fail to replicate in subsequent experiments. We show that this problem arises because standard `point-form null' significance tests consider only within-experiment but ignore between-experiment variation, and so systematically underestimate the degree of random variation in results. We give an extension to standard significance testing that addresses this problem by analysing both within- and between-experiment variation. This `distributional null' approach does not underestimate experimental variability and so is not overconfident in identifying significance; because this approach addresses between-experiment variation, it gives mathematically coherent estimates for the probability of replication of significant results. Using a large-scale replication dataset (the first `Many Labs' project), we show that many experimental results that appear statistically significant in standard tests are in fact consistent with random variation when both within- and between-experiment variation are taken into account in this approach. Further, grouping experiments in this dataset into `predictor-target' pairs we show that the predicted replication probabilities for target experiments produced in this approach (given predictor experiment results and the sample sizes of the two experiments) are strongly correlated with observed replication rates. Distributional null hypothesis testing thus gives researchers a statistical tool for identifying statistically significant and reliably replicable results.
