Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing
Helen R. Fryer, Nicolas Arning, Daniel J. Wilson
TL;DR
The paper addresses the tension between Bayesian model averaging and classical hypothesis testing by showing that Bayesian model-averaged tests can be formulated as a closed testing procedure that strongLy controls the frequentist familywise error rate. It derives a chi-square tail approximation for the model-averaged posterior odds, enabling asymptotic p-values and interconversion with frequentist test statistics via the model-averaged deviance. Through theoretical results and extensive simulations plus a Mendelian randomization AMD study, it demonstrates how testing groups of correlated variables improves error control and power, while also highlighting finite-sample inflation and mitigation strategies guided by grouping. The work provides a practical framework to bridge Bayesian FDR and frequentist FWER, enabling post-hoc variable grouping, multilevel testing, and interpretability in high-dimensional settings, with broad implications for hypothesis testing practice.
Abstract
Establishing the frequentist properties of Bayesian approaches widens their appeal and offers new understanding. In hypothesis testing, Bayesian model averaging addresses the problem that conclusions are sensitive to variable selection. But Bayesian false discovery rate (FDR) guarantees are sensitive to subjective prior assumptions. Here we show that Bayesian model-averaged hypothesis testing is a closed testing procedure that controls the frequentist familywise error rate (FWER) in the strong sense. To quantify the FWER, we use the theory of regular variation and likelihood asymptotics to derive a chi-squared tail approximation for the model-averaged posterior odds. Convergence is pointwise as the sample size grows and, in a simplified setting subject to a minimum effect size assumption, uniform. The 'Doublethink' method computes simultaneous posterior odds and asymptotic p-values for model-averaged hypothesis testing. We explore Doublethink through a Mendelian randomization study and simulations, comparing to approaches like LASSO, stepwise regression, the Benjamini-Hochberg procedure, the harmonic mean p-value and e-values. We consider the limitations of the approach, including finite-sample inflation, and mitigations, like testing groups of correlated variables. We discuss the benefits of Doublethink, including post-hoc variable selection, and its wider implications for the theory and practice of hypothesis testing.
