Table of Contents
Fetching ...

Large-Scale Bayesian Causal Discovery with Interventional Data

Seong Woo Han, Daniel Duy Vo, Brielin C. Brown

TL;DR

IBCD tackles large-scale causal discovery with interventional data by reducing data to total effect summaries and employing a matrix-normal likelihood for efficient, uncertainty-aware inference. A hybrid empirical Bayes prior combines data-driven global sparsity (via ER/SF priors) with edge-specific weights learned from observational covariances, while a non-centered horseshoe prior enables sparse, interpretable edge identification. The approach yields calibrated edge inclusion probabilities and demonstrates superior structure recovery on synthetic data and robust, reproducible results on Perturb-seq datasets, with SF priors improving cross-fold and cross-dataset stability. Collectively, IBICD offers scalable, uncertainty-aware causal graph learning suitable for genome-scale perturbation data and beyond, while outlining avenues for non-linear extensions and richer posterior analyses.

Abstract

Inferring the causal relationships among a set of variables in the form of a directed acyclic graph (DAG) is an important but notoriously challenging problem. Recently, advancements in high-throughput genomic perturbation screens have inspired development of methods that leverage interventional data to improve model identification. However, existing methods still suffer poor performance on large-scale tasks and fail to quantify uncertainty. Here, we propose Interventional Bayesian Causal Discovery (IBCD), an empirical Bayesian framework for causal discovery with interventional data. Our approach models the likelihood of the matrix of total causal effects, which can be approximated by a matrix normal distribution, rather than the full data matrix. We place a spike-and-slab horseshoe prior on the edges and separately learn data-driven weights for scale-free and Erdős-Rényi structures from observational data, treating each edge as a latent variable to enable uncertainty-aware inference. Through extensive simulation, we show that IBCD achieves superior structure recovery compared to existing baselines. We apply IBCD to CRISPR perturbation (Perturb-seq) data on 521 genes, demonstrating that edge posterior inclusion probabilities enable identification of robust graph structures.

Large-Scale Bayesian Causal Discovery with Interventional Data

TL;DR

IBCD tackles large-scale causal discovery with interventional data by reducing data to total effect summaries and employing a matrix-normal likelihood for efficient, uncertainty-aware inference. A hybrid empirical Bayes prior combines data-driven global sparsity (via ER/SF priors) with edge-specific weights learned from observational covariances, while a non-centered horseshoe prior enables sparse, interpretable edge identification. The approach yields calibrated edge inclusion probabilities and demonstrates superior structure recovery on synthetic data and robust, reproducible results on Perturb-seq datasets, with SF priors improving cross-fold and cross-dataset stability. Collectively, IBICD offers scalable, uncertainty-aware causal graph learning suitable for genome-scale perturbation data and beyond, while outlining avenues for non-linear extensions and richer posterior analyses.

Abstract

Inferring the causal relationships among a set of variables in the form of a directed acyclic graph (DAG) is an important but notoriously challenging problem. Recently, advancements in high-throughput genomic perturbation screens have inspired development of methods that leverage interventional data to improve model identification. However, existing methods still suffer poor performance on large-scale tasks and fail to quantify uncertainty. Here, we propose Interventional Bayesian Causal Discovery (IBCD), an empirical Bayesian framework for causal discovery with interventional data. Our approach models the likelihood of the matrix of total causal effects, which can be approximated by a matrix normal distribution, rather than the full data matrix. We place a spike-and-slab horseshoe prior on the edges and separately learn data-driven weights for scale-free and Erdős-Rényi structures from observational data, treating each edge as a latent variable to enable uncertainty-aware inference. Through extensive simulation, we show that IBCD achieves superior structure recovery compared to existing baselines. We apply IBCD to CRISPR perturbation (Perturb-seq) data on 521 genes, demonstrating that edge posterior inclusion probabilities enable identification of robust graph structures.

Paper Structure

This paper contains 31 sections, 33 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: The IBCD model. We treat $\hat{R}$ and $\hat{S}$ as observed and model $\hat{R}$ as coming from a matrix normal distribution $\mathcal{MN}(R,U,V)$ with $U$ and $V$ determined from $\hat{S}$. We place a spike–and–slab horseshoe prior on $G$ which determines the mean matrix $R$ via \ref{['eq:r_g_inv']}.
  • Figure 2: Comparison of F1 and SHD with increasing numbers of dimensions on ER and SF graphs, using 100 intervention samples per variable.
  • Figure 3: Comparison of F1 and SHD with increasing numbers of intervention sample sizes on ER and SF graphs with $D=50$.
  • Figure 4: Calibration of posterior inclusion probability on 500D graphs under ER (left) and SF (right) graph in Figure \ref{['fig:f1_shd_d']}. Bars show mean $\pm$ s.d. over ten replicates. True edges concentrate in high PIP, confirming reliable uncertainty quantification.
  • Figure 5: Agreement of log-transformed PIP across the ten cross-validation fold pairs on the K562 GWPS screen under ER and SF priors. Each point is one edge; the density of points forms a grey cloud. The positive correlation shows that high-PIP edges are consistently identified across folds.
  • ...and 6 more figures