Bivariate Causal Discovery using Bayesian Model Selection
Anish Dhir, Samuel Power, Mark van der Wilk
TL;DR
This work reframes bivariate causal discovery as Bayesian model selection, relaxing strict identifiability constraints by embedding causal assumptions into priors and treating direction as competing BCMs. By using a Gaussian process latent variable model (GPLVM) to flexibly model the joint distribution, the approach can distinguish X → Y from X ← Y even when distribution-equivalent under likelihoods holds, leveraging the Independent Causal Mechanisms (ICM) principle and separable priors. The authors derive conditions under which marginal likelihoods discriminate causal directions, provide a statistical test for the asymmetry, and analyze performance under model misspecification. Empirically, the GPLVM-based method outperforms restricted identifiability methods and other flexible baselines across real and synthetic datasets, highlighting the practical value of Bayesian model selection with expressive priors for causal discovery. The work also discusses robustness to misspecification and outlines extensions to deeper models, underscoring the method’s potential for real-world causal inference tasks.
Abstract
Much of the causal discovery literature prioritises guaranteeing the identifiability of causal direction in statistical models. For structures within a Markov equivalence class, this requires strong assumptions which may not hold in real-world datasets, ultimately limiting the usability of these methods. Building on previous attempts, we show how to incorporate causal assumptions within the Bayesian framework. Identifying causal direction then becomes a Bayesian model selection problem. This enables us to construct models with realistic assumptions, and consequently allows for the differentiation between Markov equivalent causal structures. We analyse why Bayesian model selection works in situations where methods based on maximum likelihood fail. To demonstrate our approach, we construct a Bayesian non-parametric model that can flexibly model the joint distribution. We then outperform previous methods on a wide range of benchmark datasets with varying data generating assumptions.
