Table of Contents
Fetching ...

Untangling Sample and Population Level Estimands in Bayesian Causal Inference

Arman Oganisian

Abstract

Model-based Bayesian inference for causal estimands has been growing in popularity, however many misconceptions and implementation errors arise from conflating sample and population-level estimands. Our goal is to elucidate the crucial differences between sample and population-level inference across identification, modeling, computation, and interpretation. For example, common sample-level estimands require cross-world Bayesian modeling, whereas many (but not all) population-level estimands do not. Similarly, the former requires explicit imputation of counterfactuals from their joint posterior, whereas the latter typically only requires a posterior distribution over parameters and perhaps post-hoc Monte Carlo simulation. We provide a total of four examples with a particular emphasis on cross-world assumptions and Bayesian nonparametric methods. Because the differences are conceptually subtle but can be practically substantial, each example is discussed in detail with implementation code in Stan. We also provide a detailed discussion of common errors when implementing the Bayesian g-formula. The overarching message here is to always engage in first-principles thinking about which marginal of the joint posterior is of interest in a particular causal analysis, then follow the strict logic of Bayes' theorem and probability to avoid common implementation errors.

Untangling Sample and Population Level Estimands in Bayesian Causal Inference

Abstract

Model-based Bayesian inference for causal estimands has been growing in popularity, however many misconceptions and implementation errors arise from conflating sample and population-level estimands. Our goal is to elucidate the crucial differences between sample and population-level inference across identification, modeling, computation, and interpretation. For example, common sample-level estimands require cross-world Bayesian modeling, whereas many (but not all) population-level estimands do not. Similarly, the former requires explicit imputation of counterfactuals from their joint posterior, whereas the latter typically only requires a posterior distribution over parameters and perhaps post-hoc Monte Carlo simulation. We provide a total of four examples with a particular emphasis on cross-world assumptions and Bayesian nonparametric methods. Because the differences are conceptually subtle but can be practically substantial, each example is discussed in detail with implementation code in Stan. We also provide a detailed discussion of common errors when implementing the Bayesian g-formula. The overarching message here is to always engage in first-principles thinking about which marginal of the joint posterior is of interest in a particular causal analysis, then follow the strict logic of Bayes' theorem and probability to avoid common implementation errors.

Paper Structure

This paper contains 21 sections, 34 equations, 5 figures.

Figures (5)

  • Figure 1: Posterior estimates produced using Stan as described in Appendix Section \ref{['app:stan']}. Left: boxplot of posterior draws from the distribution of the PATE, $\Psi$, and the SATE, $\theta$. The posterior distributions have the same center, but posterior uncertainty for the PATE is larger. Right: posterior mean (points) and 95% credible intervals (segments) of each subject's ITE, $\theta_i$, and the CATE evaluated at each $l_i$, $\Psi(l_i)$ for $i=1,2,\dots, 30$. We avoid plotting the last 20 subjects to avoid compression of the plot. The point estimates are similar, but credible intervals in the ITE is wider than the CATE.
  • Figure 2: Posterior draws of of the PATE under different covariate models. Left: the "exact" draws computed under draws of true covariate model's parameters $\Psi^{(t)} = ( (\beta_{01}^{(t)} - \beta_{00}^{(t)}) + ( \beta_{11}^{(t)} - \beta_{10}^{(t)}) ) \eta^{(t)}$. Middle: the BB draws $\Psi_{BB}^{(t)} = \sum_{i=1}^n \psi^{(t)}(l_i) \phi_{Li}^{(t)}$. Right: the MATE draws computed under the empirical distribution held fixed: $\Psi^{(t)}_{MATE} = \frac{1}{n} \sum_{i=1}^n \psi^{(t)}(l_i)$. The BB draws have closer spread to the exact PATE draws - but all share a similar center.
  • Figure 3: Results from truncated DPM discussed in Example 1. Top left: plot of posterior regression function (mean in bold line with some draws in faded lines) of $Y$ on $L$ for each $a\in\{0,1\}$ against training data. Top right: plot of the posterior CATE function $\psi(l)$ with posterior mean in bold some some realizations in faded blue lines. Bottom left: posterior mean and 95% credible interval for each ITE. Bottom right: posterior mean density function (bold) along with some density function draws (in faded lines) against observed data (gray bars).
  • Figure 4: Explanation of MC procedure using synthetic data and truncated DPM model. Left: For $\tau=-.5$, posterior draws of $\psi_1(\phi_{Y}) = P(Y(1) > \tau, Y(0) >\tau \mid L=l)$ at selected grid of $l$ points and posterior mean at each $l$ in bold. Each draw is obtained via MC integration. For instance, the red point at $L=1.3$ is posterior draw at iteration $t=706$. Right: this red point was obtained by simulating $B=500$ draws $[y^{(b)}(0),y^{(b)}(1)]$ from its joint distribution with $t^{th}$ parameter draws plugged in. These are visualized as points. The region of integration $[y(1)>\tau, y(0)>\tau]$ is gray and 19.8% of the $B$ draws fall in this region.
  • Figure 5: Posterior inference for unit $i=1$'s potential outcome curve across $K=10$ possible treatments using synthetic data under the $K$-variate normal model described in Example 3. Note that $a_1=2$ and so $Y_1(2)$ is observed and plotted as a bold red dot - there is no posterior uncertainty about $Y_1(2)$ since posterior inference is conditional on observed data, which includes $Y_1(2)$. In black are the posterior mean and 95% credible intervals for $\{Y_1(a_1): a_1\in\mathcal{A}\}$ while the faded red line is the true curve. In blue we show posterior mean and intervals for the population-level analogue $\{ E[Y(a) \mid L=l_i]: a \in\mathcal{A}\}$. The left panel presents inference under $\rho=.9$ and the right panel presents inference under $\rho=0$.