Table of Contents
Fetching ...

Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing

Helen R. Fryer, Nicolas Arning, Daniel J. Wilson

TL;DR

The paper addresses the tension between Bayesian model averaging and classical hypothesis testing by showing that Bayesian model-averaged tests can be formulated as a closed testing procedure that strongLy controls the frequentist familywise error rate. It derives a chi-square tail approximation for the model-averaged posterior odds, enabling asymptotic p-values and interconversion with frequentist test statistics via the model-averaged deviance. Through theoretical results and extensive simulations plus a Mendelian randomization AMD study, it demonstrates how testing groups of correlated variables improves error control and power, while also highlighting finite-sample inflation and mitigation strategies guided by grouping. The work provides a practical framework to bridge Bayesian FDR and frequentist FWER, enabling post-hoc variable grouping, multilevel testing, and interpretability in high-dimensional settings, with broad implications for hypothesis testing practice.

Abstract

Establishing the frequentist properties of Bayesian approaches widens their appeal and offers new understanding. In hypothesis testing, Bayesian model averaging addresses the problem that conclusions are sensitive to variable selection. But Bayesian false discovery rate (FDR) guarantees are sensitive to subjective prior assumptions. Here we show that Bayesian model-averaged hypothesis testing is a closed testing procedure that controls the frequentist familywise error rate (FWER) in the strong sense. To quantify the FWER, we use the theory of regular variation and likelihood asymptotics to derive a chi-squared tail approximation for the model-averaged posterior odds. Convergence is pointwise as the sample size grows and, in a simplified setting subject to a minimum effect size assumption, uniform. The 'Doublethink' method computes simultaneous posterior odds and asymptotic p-values for model-averaged hypothesis testing. We explore Doublethink through a Mendelian randomization study and simulations, comparing to approaches like LASSO, stepwise regression, the Benjamini-Hochberg procedure, the harmonic mean p-value and e-values. We consider the limitations of the approach, including finite-sample inflation, and mitigations, like testing groups of correlated variables. We discuss the benefits of Doublethink, including post-hoc variable selection, and its wider implications for the theory and practice of hypothesis testing.

Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing

TL;DR

The paper addresses the tension between Bayesian model averaging and classical hypothesis testing by showing that Bayesian model-averaged tests can be formulated as a closed testing procedure that strongLy controls the frequentist familywise error rate. It derives a chi-square tail approximation for the model-averaged posterior odds, enabling asymptotic p-values and interconversion with frequentist test statistics via the model-averaged deviance. Through theoretical results and extensive simulations plus a Mendelian randomization AMD study, it demonstrates how testing groups of correlated variables improves error control and power, while also highlighting finite-sample inflation and mitigation strategies guided by grouping. The work provides a practical framework to bridge Bayesian FDR and frequentist FWER, enabling post-hoc variable grouping, multilevel testing, and interpretability in high-dimensional settings, with broad implications for hypothesis testing practice.

Abstract

Establishing the frequentist properties of Bayesian approaches widens their appeal and offers new understanding. In hypothesis testing, Bayesian model averaging addresses the problem that conclusions are sensitive to variable selection. But Bayesian false discovery rate (FDR) guarantees are sensitive to subjective prior assumptions. Here we show that Bayesian model-averaged hypothesis testing is a closed testing procedure that controls the frequentist familywise error rate (FWER) in the strong sense. To quantify the FWER, we use the theory of regular variation and likelihood asymptotics to derive a chi-squared tail approximation for the model-averaged posterior odds. Convergence is pointwise as the sample size grows and, in a simplified setting subject to a minimum effect size assumption, uniform. The 'Doublethink' method computes simultaneous posterior odds and asymptotic p-values for model-averaged hypothesis testing. We explore Doublethink through a Mendelian randomization study and simulations, comparing to approaches like LASSO, stepwise regression, the Benjamini-Hochberg procedure, the harmonic mean p-value and e-values. We consider the limitations of the approach, including finite-sample inflation, and mitigations, like testing groups of correlated variables. We discuss the benefits of Doublethink, including post-hoc variable selection, and its wider implications for the theory and practice of hypothesis testing.
Paper Structure (33 sections, 11 theorems, 260 equations, 7 figures, 2 tables)

This paper contains 33 sections, 11 theorems, 260 equations, 7 figures, 2 tables.

Key Result

Theorem 1

Bayesian hypothesis tests are a type of CTP known as a shortcut CTP. That is, $\phi_{\boldsymbol s}({\boldsymbol y})=\psi_{\boldsymbol s}({\boldsymbol y})=1$ (rejection of $\theta\in\omega_{\boldsymbol s}$) automatically implies $\phi_{\boldsymbol r}({\boldsymbol y})=\psi_{\boldsymbol r}({\boldsymbo

Figures (7)

  • Figure 1: Inflation in the simplified two-variable model testing $\beta_1=\beta_2=0$ (A) and $\beta_1=0$ (B). A: FPR as a function of sample size: simulations (black line) and Theorem \ref{['theorem_fpr']} (green line). Assuming $\mu=1$, $h=1$, $\tau=9$, $\rho=0$. B: FPR as a function of $\beta_2$ for $\rho\in[0.0, 1.0]$ (shaded grey lines, labelled by $\rho$) and Theorem \ref{['theorem_fpr']} (green dashed line). Assuming $n=145$, $\mu=0.1$, $h=1$, $\sigma=1$, $\tau=9$. Each panel is based on 10 million simulations.
  • Figure 2: Inflation in simulations based on the AMD Mendelian randomization example with $\nu=15$ variables and $n=145$ (A) or $n=14 \ 500$ (B). BFWER as a function of $\rho$ (black points), versus AFWER (grey points) and Theorem 3 (green line). Assuming $\mu=0.1$, $h=1$, $\tau=9$. Error bars based on $50~000$ (A) and $500~000$ (B) simulations.
  • Figure 3: Out-of-sample prediction error in $10~000$ simulations from the Doublethink prior with $\mu=0.01$ and $h=1$. The key applies to all figures in this section.
  • Figure 4: Estimator error (A) and standard error coverage (B) in $10~000$ simulations from the Doublethink prior with $\mu=0.01$ and $h=1$. Expected coverage (black line) is shown, with allowance for Monte Carlo error (grey lines; 95% confidence interval).
  • Figure 5: Type I Bayes FWER (A) and type II strikeout rate (B) for marginal tests of the significance of individual variables in $10~000$ simulations from the Doublethink prior with $\mu=0.01$ and $h=1$. Expected type I BFWER (black line) is shown, with allowance for Monte Carlo error (grey lines; 95% confidence interval).
  • ...and 2 more figures

Theorems & Definitions (37)

  • Definition 1: Frequentist familywise error rate control
  • Definition 2: Bayesian false discovery rate control
  • Theorem 1: Bayesian hypothesis tests simultaneously control the Bayesian FDR and the frequentist FWER
  • proof
  • Definition 3: Regression problem
  • Definition 4: Likelihood ratio test; LRT
  • Example 1: FWER control of the regression problem
  • Definition 5: Posterior odds via the Bayesian information criterion; BIC
  • Definition 6: Joint Bayesian-frequentist test: Johnson model
  • Definition 7: Bayesian model-averaged hypothesis testing: Doublethink model
  • ...and 27 more