Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing

Helen R. Fryer; Nicolas Arning; Daniel J. Wilson

Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing

Helen R. Fryer, Nicolas Arning, Daniel J. Wilson

TL;DR

The paper addresses the tension between Bayesian model averaging and classical hypothesis testing by showing that Bayesian model-averaged tests can be formulated as a closed testing procedure that strongLy controls the frequentist familywise error rate. It derives a chi-square tail approximation for the model-averaged posterior odds, enabling asymptotic p-values and interconversion with frequentist test statistics via the model-averaged deviance. Through theoretical results and extensive simulations plus a Mendelian randomization AMD study, it demonstrates how testing groups of correlated variables improves error control and power, while also highlighting finite-sample inflation and mitigation strategies guided by grouping. The work provides a practical framework to bridge Bayesian FDR and frequentist FWER, enabling post-hoc variable grouping, multilevel testing, and interpretability in high-dimensional settings, with broad implications for hypothesis testing practice.

Abstract

Establishing the frequentist properties of Bayesian approaches widens their appeal and offers new understanding. In hypothesis testing, Bayesian model averaging addresses the problem that conclusions are sensitive to variable selection. But Bayesian false discovery rate (FDR) guarantees are sensitive to subjective prior assumptions. Here we show that Bayesian model-averaged hypothesis testing is a closed testing procedure that controls the frequentist familywise error rate (FWER) in the strong sense. To quantify the FWER, we use the theory of regular variation and likelihood asymptotics to derive a chi-squared tail approximation for the model-averaged posterior odds. Convergence is pointwise as the sample size grows and, in a simplified setting subject to a minimum effect size assumption, uniform. The 'Doublethink' method computes simultaneous posterior odds and asymptotic p-values for model-averaged hypothesis testing. We explore Doublethink through a Mendelian randomization study and simulations, comparing to approaches like LASSO, stepwise regression, the Benjamini-Hochberg procedure, the harmonic mean p-value and e-values. We consider the limitations of the approach, including finite-sample inflation, and mitigations, like testing groups of correlated variables. We discuss the benefits of Doublethink, including post-hoc variable selection, and its wider implications for the theory and practice of hypothesis testing.

Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing

TL;DR

Abstract

Paper Structure (33 sections, 11 theorems, 260 equations, 7 figures, 2 tables)

This paper contains 33 sections, 11 theorems, 260 equations, 7 figures, 2 tables.

Introduction
Bayesian hypothesis testing is a closed testing procedure that controls the familywise error rate in the strong sense
Frequentist false positive rate of a Bayesian model-averaged regression converges pointwise as the sample size grows
Strong-sense familywise error rate of a Bayesian model-averaged regression converges pointwise as the sample size grows
Inflation in a simplified two-variable model
Application to Mendelian randomization study of age-related macular degeneration
Inflation between highly correlated variables: simulation approach
Comparison to related approaches: simulations
Discussion
Data Availability Statement
Acknowledgements and Funding
Regularity conditions
Background theory
Closed testing procedures control the familywise error rate in the strong sense
Likelihood assumptions for simultaneous Bayesian-frequentist hypothesis testing
...and 18 more sections

Key Result

Theorem 1

Bayesian hypothesis tests are a type of CTP known as a shortcut CTP. That is, $\phi_{\boldsymbol s}({\boldsymbol y})=\psi_{\boldsymbol s}({\boldsymbol y})=1$ (rejection of $\theta\in\omega_{\boldsymbol s}$) automatically implies $\phi_{\boldsymbol r}({\boldsymbol y})=\psi_{\boldsymbol r}({\boldsymbo

Figures (7)

Figure 1: Inflation in the simplified two-variable model testing $\beta_1=\beta_2=0$ (A) and $\beta_1=0$ (B). A: FPR as a function of sample size: simulations (black line) and Theorem \ref{['theorem_fpr']} (green line). Assuming $\mu=1$, $h=1$, $\tau=9$, $\rho=0$. B: FPR as a function of $\beta_2$ for $\rho\in[0.0, 1.0]$ (shaded grey lines, labelled by $\rho$) and Theorem \ref{['theorem_fpr']} (green dashed line). Assuming $n=145$, $\mu=0.1$, $h=1$, $\sigma=1$, $\tau=9$. Each panel is based on 10 million simulations.
Figure 2: Inflation in simulations based on the AMD Mendelian randomization example with $\nu=15$ variables and $n=145$ (A) or $n=14 \ 500$ (B). BFWER as a function of $\rho$ (black points), versus AFWER (grey points) and Theorem 3 (green line). Assuming $\mu=0.1$, $h=1$, $\tau=9$. Error bars based on $50~000$ (A) and $500~000$ (B) simulations.
Figure 3: Out-of-sample prediction error in $10~000$ simulations from the Doublethink prior with $\mu=0.01$ and $h=1$. The key applies to all figures in this section.
Figure 4: Estimator error (A) and standard error coverage (B) in $10~000$ simulations from the Doublethink prior with $\mu=0.01$ and $h=1$. Expected coverage (black line) is shown, with allowance for Monte Carlo error (grey lines; 95% confidence interval).
Figure 5: Type I Bayes FWER (A) and type II strikeout rate (B) for marginal tests of the significance of individual variables in $10~000$ simulations from the Doublethink prior with $\mu=0.01$ and $h=1$. Expected type I BFWER (black line) is shown, with allowance for Monte Carlo error (grey lines; 95% confidence interval).
...and 2 more figures

Theorems & Definitions (37)

Definition 1: Frequentist familywise error rate control
Definition 2: Bayesian false discovery rate control
Theorem 1: Bayesian hypothesis tests simultaneously control the Bayesian FDR and the frequentist FWER
proof
Definition 3: Regression problem
Definition 4: Likelihood ratio test; LRT
Example 1: FWER control of the regression problem
Definition 5: Posterior odds via the Bayesian information criterion; BIC
Definition 6: Joint Bayesian-frequentist test: Johnson model
Definition 7: Bayesian model-averaged hypothesis testing: Doublethink model
...and 27 more

Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing

TL;DR

Abstract

Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (37)