Table of Contents
Fetching ...

Adaptive Nonparametric Perturbations of Parametric Models with Generalized Bayes

Bohan Wu, Eli N. Weinstein, Sohrab Salehi, Yixin Wang, David M. Blei

Abstract

Parametric Bayesian modeling offers a powerful and flexible toolbox for machine learning. Yet the model, however detailed, may still be wrong, and this can make inferences untrustworthy. In this paper we introduce a new class of semiparametric corrections for parametric Bayesian models, when the target of inference is a functional of the true data distribution. Our starting point is a fully Bayesian modeling approach, which explicitly accounts for the possibility that the parametric model is wrong. Asymptotic analysis shows that this approach is both robust to model misspecification and data efficient, achieving fast convergence when the parametric model is close to true. However, the fully Bayesian approach is limited in its practical usefulness by the challenges of conducting inference and computing a Bayes factor for a nonparametric model. We therefore propose a novel model correction based on generalized Bayes, which entirely avoids the need to compute a nonparametric Bayes factor, but preserves the robustness and efficiency of the fully Bayesian approach. We demonstrate our method by estimating causal effects of gene expression from single cell RNA sequencing data. Overall, we offer a new efficient approach to robust Bayesian inference with parametric models.

Adaptive Nonparametric Perturbations of Parametric Models with Generalized Bayes

Abstract

Parametric Bayesian modeling offers a powerful and flexible toolbox for machine learning. Yet the model, however detailed, may still be wrong, and this can make inferences untrustworthy. In this paper we introduce a new class of semiparametric corrections for parametric Bayesian models, when the target of inference is a functional of the true data distribution. Our starting point is a fully Bayesian modeling approach, which explicitly accounts for the possibility that the parametric model is wrong. Asymptotic analysis shows that this approach is both robust to model misspecification and data efficient, achieving fast convergence when the parametric model is close to true. However, the fully Bayesian approach is limited in its practical usefulness by the challenges of conducting inference and computing a Bayes factor for a nonparametric model. We therefore propose a novel model correction based on generalized Bayes, which entirely avoids the need to compute a nonparametric Bayes factor, but preserves the robustness and efficiency of the fully Bayesian approach. We demonstrate our method by estimating causal effects of gene expression from single cell RNA sequencing data. Overall, we offer a new efficient approach to robust Bayesian inference with parametric models.

Paper Structure

This paper contains 57 sections, 22 theorems, 141 equations, 13 figures.

Key Result

Proposition 1

Assume the marginal density $\mathrm{p}_{\mathrm{pm}}(x_{1:n}) := \int_\Theta \mathrm{p}_\theta(x_{1:n}) d \Pi_{\mathrm{pm}}(\theta)$ is well-defined. Under assumption:priorassumption:entropyassumption:rate-diff given in sect-BNP, $\blacktriangleleft$$\blacktriangleleft$

Figures (13)

  • Figure 1: Synthetic data (generalized) Bayes factors. The log Bayes factor and log generalized Bayes factor comparing the parametric model to a nonparametric alternative, where positive values indicate the parametric model is favored. (a) NPP model with a Polya Tree. (b,c,d) gNPP models with the MMD, KSD and Wasserstein.
  • Figure 2: Synthetic data results. (a,b) KL divergence between the true data density and the posterior predictive of each model. (c,d) Absolute difference between the posterior mean estimate of the median and the true median, for each model, using the MMD gNPP. (e,f) Calibration of the MMD gNPP. We plot how often, across independent simulations, the posterior credible interval covers the true median. The nominal coverage is 90% (dashed).
  • Figure 3: Causal model. To analyze the effects of gene expression, we assume this causal graphical model. (a) The initial model where all variables are observed. (b) The model after intervening on the treatment variable $A$. The goal is to estimate the effect of the treatment gene on the outcome gene, as highlighted in blue.
  • Figure 4: Effect of FOXP3 on GZMH. a. Posterior probability of the ATE being positive under the parametric, nonparametric, and gNPP models. $n$ denotes the size of the (subsampled) dataset. Values are the estimated median from 10 independent data subsamples and model samples. b. Generalized mixing weights, $\hat{\eta}_n$. The estimated confidence interval (CI) is across independent data subsamples and model samples.
  • Figure 5: Effect of FOXP3 on GZMK. a. Posterior probability of the ATE being positive under the parametric, nonparametric, and gNPP models. $n$ denotes the size of the (subsampled) dataset. b. Generalized mixing weights, $\hat{\eta}_n$. CI: confidence interval across independent data subsamples and model samples.
  • ...and 8 more figures

Theorems & Definitions (31)

  • Remark 1: Choice of rate
  • Proposition 1: $\eta_n$ is consistent for model selection
  • Example 1: Dirichlet process perturbations are not consistent
  • Example 2: Dirichlet process normal mixture perturbations are consistent
  • Theorem 1: NPP models are efficient and robust
  • Theorem 2: The posterior expected empirical divergence converges at a rate $r_{m,n} \lor n^{-1}$
  • Theorem 3: $\hat{\eta}_n$ is consistent for model selection
  • Theorem 4: gNPP approximations are efficient and robust
  • Definition 5: KL divergence
  • Definition 6: Hellinger distance
  • ...and 21 more