Table of Contents
Fetching ...

Challenges and Considerations in the Evaluation of Bayesian Causal Discovery

Amir Mohammad Karimi Mamaghan, Panagiotis Tigas, Karl Henrik Johansson, Yarin Gal, Yashas Annadani, Stefan Bauer

TL;DR

This paper investigates how to evaluate Bayesian Causal Discovery (BCD) when the object of interest is the posterior over graphs and parameters. It shows that commonly used graph-centric metrics (e.g., $E$-SHD, $E$-CPDAG\,SHD, AUROC, AUPRC) often fail to reflect posterior quality in low-data or non-identifiable regimes, and that the reliability of these metrics scales with the entropy of the true posterior. Through a comprehensive empirical study on linear additive noise models, the authors compare several BCD methods (e.g., BCD-Nets, DIBS, VBG, BGIES) against the true posterior using Graph MMD and Params MMD, revealing residual divergences and entropy-driven limitations of proxy metrics. They propose alternative evaluation procedures, including downstream tasks like experimental design and causal effect estimation, to assess BCD when the posterior is highly entropic, with practical implications for safe experimental planning and causal decision making.

Abstract

Representing uncertainty in causal discovery is a crucial component for experimental design, and more broadly, for safe and reliable causal decision making. Bayesian Causal Discovery (BCD) offers a principled approach to encapsulating this uncertainty. Unlike non-Bayesian causal discovery, which relies on a single estimated causal graph and model parameters for assessment, evaluating BCD presents challenges due to the nature of its inferred quantity - the posterior distribution. As a result, the research community has proposed various metrics to assess the quality of the approximate posterior. However, there is, to date, no consensus on the most suitable metric(s) for evaluation. In this work, we reexamine this question by dissecting various metrics and understanding their limitations. Through extensive empirical evaluation, we find that many existing metrics fail to exhibit a strong correlation with the quality of approximation to the true posterior, especially in scenarios with low sample sizes where BCD is most desirable. We highlight the suitability (or lack thereof) of these metrics under two distinct factors: the identifiability of the underlying causal model and the quantity of available data. Both factors affect the entropy of the true posterior, indicating that the current metrics are less fitting in settings of higher entropy. Our findings underline the importance of a more nuanced evaluation of new methods by taking into account the nature of the true posterior, as well as guide and motivate the development of new evaluation procedures for this challenge.

Challenges and Considerations in the Evaluation of Bayesian Causal Discovery

TL;DR

This paper investigates how to evaluate Bayesian Causal Discovery (BCD) when the object of interest is the posterior over graphs and parameters. It shows that commonly used graph-centric metrics (e.g., -SHD, -CPDAG\,SHD, AUROC, AUPRC) often fail to reflect posterior quality in low-data or non-identifiable regimes, and that the reliability of these metrics scales with the entropy of the true posterior. Through a comprehensive empirical study on linear additive noise models, the authors compare several BCD methods (e.g., BCD-Nets, DIBS, VBG, BGIES) against the true posterior using Graph MMD and Params MMD, revealing residual divergences and entropy-driven limitations of proxy metrics. They propose alternative evaluation procedures, including downstream tasks like experimental design and causal effect estimation, to assess BCD when the posterior is highly entropic, with practical implications for safe experimental planning and causal decision making.

Abstract

Representing uncertainty in causal discovery is a crucial component for experimental design, and more broadly, for safe and reliable causal decision making. Bayesian Causal Discovery (BCD) offers a principled approach to encapsulating this uncertainty. Unlike non-Bayesian causal discovery, which relies on a single estimated causal graph and model parameters for assessment, evaluating BCD presents challenges due to the nature of its inferred quantity - the posterior distribution. As a result, the research community has proposed various metrics to assess the quality of the approximate posterior. However, there is, to date, no consensus on the most suitable metric(s) for evaluation. In this work, we reexamine this question by dissecting various metrics and understanding their limitations. Through extensive empirical evaluation, we find that many existing metrics fail to exhibit a strong correlation with the quality of approximation to the true posterior, especially in scenarios with low sample sizes where BCD is most desirable. We highlight the suitability (or lack thereof) of these metrics under two distinct factors: the identifiability of the underlying causal model and the quantity of available data. Both factors affect the entropy of the true posterior, indicating that the current metrics are less fitting in settings of higher entropy. Our findings underline the importance of a more nuanced evaluation of new methods by taking into account the nature of the true posterior, as well as guide and motivate the development of new evaluation procedures for this challenge.
Paper Structure (42 sections, 7 equations, 25 figures)

This paper contains 42 sections, 7 equations, 25 figures.

Figures (25)

  • Figure 1: Evaluation of the models on ER1 graphs in the non-identifiable case ($d=5$). In low sample regimes, true posterior itself is evaluated to be worse on these metrics than their approximations.
  • Figure 2: Evaluation of the models on ER1 graphs in the identifiable case ($d=5$). In low sample regimes, true posterior itself is evaluated to be worse on these metrics than their approximations.
  • Figure 3: Spearman's rank correlation coefficient between evaluation metrics with 5 samples ($d=5$). The first and the second rows correspond to the non-identifiable and identifiable cases, respectively. All the graph-based metrics are not correlated with the Graph MMD. Params MMD is not correlated with any of the other metrics. Graph MMD and Params MMD are metrics that evaluate against the true posterior.
  • Figure 4: Spearman's rank correlation coefficient between evaluation metrics with 100 samples ($d=5$). The first and the second rows correspond to the non-identifiable and identifiable cases, respectively. All the graph-based metrics are correlated with each other and also the Graph MMD. Params MMD is also correlated with other metrics. Graph MMD and Params MMD are metrics that evaluate against the true posterior.
  • Figure 5: Graph MMD of the models on ER1 graphs ($d=5$).
  • ...and 20 more figures