Table of Contents
Fetching ...

How to Tell When a Result Will Replicate: Significance and Replication in Distributional Null Hypothesis Tests

Fintan Costello, Paul Watts

TL;DR

The paper tackles the replication crisis by introducing a distributional null hypothesis that incorporates both within- and between-experiment variance. It develops a mathematically coherent framework for significance and replication (p_sig and p_rep) using a $t$-distribution with a variance ratio $b$, and provides practical closed-form estimators and bounds for everyday use. Empirical validation on the Many Labs 1 dataset shows that many results deemed significant under point-form tests are compatible with random cross-experiment variation, and that predicted replication probabilities align closely with observed replication rates. The approach offers a principled, conservative alternative to traditional significance testing and complements Bayesian methods (e.g., JZS $t$ test) by focusing on replication risk and cross-experiment heterogeneity, with broad applicability to $t$, regression, correlation, and contingency analyses.

Abstract

There is a well-known problem in Null Hypothesis Significance Testing: many statistically significant results fail to replicate in subsequent experiments. We show that this problem arises because standard `point-form null' significance tests consider only within-experiment but ignore between-experiment variation, and so systematically underestimate the degree of random variation in results. We give an extension to standard significance testing that addresses this problem by analysing both within- and between-experiment variation. This `distributional null' approach does not underestimate experimental variability and so is not overconfident in identifying significance; because this approach addresses between-experiment variation, it gives mathematically coherent estimates for the probability of replication of significant results. Using a large-scale replication dataset (the first `Many Labs' project), we show that many experimental results that appear statistically significant in standard tests are in fact consistent with random variation when both within- and between-experiment variation are taken into account in this approach. Further, grouping experiments in this dataset into `predictor-target' pairs we show that the predicted replication probabilities for target experiments produced in this approach (given predictor experiment results and the sample sizes of the two experiments) are strongly correlated with observed replication rates. Distributional null hypothesis testing thus gives researchers a statistical tool for identifying statistically significant and reliably replicable results.

How to Tell When a Result Will Replicate: Significance and Replication in Distributional Null Hypothesis Tests

TL;DR

The paper tackles the replication crisis by introducing a distributional null hypothesis that incorporates both within- and between-experiment variance. It develops a mathematically coherent framework for significance and replication (p_sig and p_rep) using a -distribution with a variance ratio , and provides practical closed-form estimators and bounds for everyday use. Empirical validation on the Many Labs 1 dataset shows that many results deemed significant under point-form tests are compatible with random cross-experiment variation, and that predicted replication probabilities align closely with observed replication rates. The approach offers a principled, conservative alternative to traditional significance testing and complements Bayesian methods (e.g., JZS test) by focusing on replication risk and cross-experiment heterogeneity, with broad applicability to , regression, correlation, and contingency analyses.

Abstract

There is a well-known problem in Null Hypothesis Significance Testing: many statistically significant results fail to replicate in subsequent experiments. We show that this problem arises because standard `point-form null' significance tests consider only within-experiment but ignore between-experiment variation, and so systematically underestimate the degree of random variation in results. We give an extension to standard significance testing that addresses this problem by analysing both within- and between-experiment variation. This `distributional null' approach does not underestimate experimental variability and so is not overconfident in identifying significance; because this approach addresses between-experiment variation, it gives mathematically coherent estimates for the probability of replication of significant results. Using a large-scale replication dataset (the first `Many Labs' project), we show that many experimental results that appear statistically significant in standard tests are in fact consistent with random variation when both within- and between-experiment variation are taken into account in this approach. Further, grouping experiments in this dataset into `predictor-target' pairs we show that the predicted replication probabilities for target experiments produced in this approach (given predictor experiment results and the sample sizes of the two experiments) are strongly correlated with observed replication rates. Distributional null hypothesis testing thus gives researchers a statistical tool for identifying statistically significant and reliably replicable results.
Paper Structure (24 sections, 100 equations, 4 figures, 2 tables)

This paper contains 24 sections, 100 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: (Left) Histogram of the standardised sample means, $z=(\overline{X}_i -\overline{X}_*)/S_0$, for US sites for each task in the Many Labs 1 replication dataset, in bins of size $0.5$ with colour representing task. The line shows the standard Normal distribution $\mathcal{N}\left(0,1\right)$, scaled by bin size $\times$ number of observations to match the histogram scale. (Right) QQ plot showing quantiles of the standardised sample means, $z=(\overline{X}_i -\overline{X}_*)/S_0$, for US sites for each task against quantiles of the standard Normal distribution.
  • Figure 2: (Left) Scatterplot of $\log_{10}$ distributional $p_{sig}$ against $\log_{10}$ point-form $p$ for all experiments in all tasks in the Many Labs 1 replication dataset for US sites (note the difference in scales). Lines indicate statistical significance at levels $\alpha=0.05$ (dotted), $\alpha=0.025$ (dashed) and $\alpha=0.01$ (solid). (Right) Analogous scatterplot of $\log_{10}$ distributional $p_{sig}$ against $\log_{10}$ Bayes Factor $BF_{10}$. Horizontal lines indicate Bayes Factor levels of $3$ (dotted: $BF_{10} \geq 3$ being typically interpreted as moderate evidence against the null), $10$ (dashed: strong evidence against the null) and $30$ (solid:extremely strong evidence against the null). Points for $9$ experiments, all with $p < 10^{-40}$ ($BF_{10} > 10^{40}$) and ranging as low as $p < 10^{-325}$ ($BF_{10} > 10^{325}$), are not shown. Note that $\log_{10} p$ and $\log_{10} BF$ values are identical modulo difference in scale (a $r=0.9999$ correlation between $\log_{10} p$ and $\log_{10} BF$ across experiments).
  • Figure 3: Predicted vs observed replication rate (for $\alpha =0.05,0.025,0.01,0.005,0.001$) for target experiments in the Many Labs 1 replication dataset for US sites, grouped by $p_{rep}$ in bins of size $1/40$. Bubble size represents the number of predictor-target pairs in each bin (sizes range from $40$ pairs for the smallest to $2800$ for the largest bubble). Hollow bubbles contain only non-significant and solid bubbles only significant predictor experiments. The left graph shows results for the double-integral $p_{rep}(t|\hat{b})$ replication estimate, the right for the closed-form $p_{rep}$ estimate. Correlation between predicted and observed replication rates for bubbles was high for $p_{rep}(t|\hat{b})$ ($r=0.91,\hat{p}_{sig} < 10^{-11}$) and for the approximation $p_{rep}$ ($r=0.81,\hat{p}_{sig} < 10^{-7}$, $\hat{p}_{rep}=1.0$ with $\alpha=0.05$ and $B=1$ for both expressions).
  • Figure 4: Sensitivity analysis. Each graph shows the predicted vs observed replication rate (for $\alpha = 0.05,0.0.25,0.01,0.005,0.001$) for target experiments in the Many Labs 1 replication dataset for US sites as in Figure \ref{['fig:replication']}, but with between-experiment variation $S_0^2$ multiplied by scale $e$.