Table of Contents
Fetching ...

Toward Falsifying Causal Graphs Using a Permutation-Based Test

Elias Eulig, Atalanti A. Mastakouri, Patrick Blöbaum, Michaela Hardt, Dominik Janzing

TL;DR

This work tackles the challenge of validating a user-provided causal DAG ${\hat{\mathcal{G}}}$ against observational data when the true graph ${\mathcal{G}}^{*}$ is unknown. It introduces a permutation-based baseline by randomly permuting graph nodes to create a null distribution of local Markov condition (LMC) violations, enabling a principled p-value $p_{\text{LMC}}$ and a related falsifiability metric $p_{\text{TPa}}$. The paper formalizes the baseline, proves uniform sampling from the DAG orbit, and demonstrates the method on synthetic and real datasets (including Sachs, Auto MPG, and AWS APM), showing that true graphs are not falsified by the metric while plausible incorrect graphs are often rejected. This approach provides a practical, interpretable benchmark for evaluating causal graphs in the absence of an established baseline, with implications for guiding causal discovery and domain-specific graph refinement.

Abstract

Understanding causal relationships among the variables of a system is paramount to explain and control its behavior. For many real-world systems, however, the true causal graph is not readily available and one must resort to predictions made by algorithms or domain experts. Therefore, metrics that quantitatively assess the goodness of a causal graph provide helpful checks before using it in downstream tasks. Existing metrics provide an $\textit{absolute}$ number of inconsistencies between the graph and the observed data, and without a baseline, practitioners are left to answer the hard question of how many such inconsistencies are acceptable or expected. Here, we propose a novel consistency metric by constructing a baseline through node permutations. By comparing the number of inconsistencies with those on the baseline, we derive an interpretable metric that captures whether the graph is significantly better than random. Evaluating on both simulated and real data sets from various domains, including biology and cloud monitoring, we demonstrate that the true graph is not falsified by our metric, whereas the wrong graphs given by a hypothetical user are likely to be falsified.

Toward Falsifying Causal Graphs Using a Permutation-Based Test

TL;DR

This work tackles the challenge of validating a user-provided causal DAG against observational data when the true graph is unknown. It introduces a permutation-based baseline by randomly permuting graph nodes to create a null distribution of local Markov condition (LMC) violations, enabling a principled p-value and a related falsifiability metric . The paper formalizes the baseline, proves uniform sampling from the DAG orbit, and demonstrates the method on synthetic and real datasets (including Sachs, Auto MPG, and AWS APM), showing that true graphs are not falsified by the metric while plausible incorrect graphs are often rejected. This approach provides a practical, interpretable benchmark for evaluating causal graphs in the absence of an established baseline, with implications for guiding causal discovery and domain-specific graph refinement.

Abstract

Understanding causal relationships among the variables of a system is paramount to explain and control its behavior. For many real-world systems, however, the true causal graph is not readily available and one must resort to predictions made by algorithms or domain experts. Therefore, metrics that quantitatively assess the goodness of a causal graph provide helpful checks before using it in downstream tasks. Existing metrics provide an number of inconsistencies between the graph and the observed data, and without a baseline, practitioners are left to answer the hard question of how many such inconsistencies are acceptable or expected. Here, we propose a novel consistency metric by constructing a baseline through node permutations. By comparing the number of inconsistencies with those on the baseline, we derive an interpretable metric that captures whether the graph is significantly better than random. Evaluating on both simulated and real data sets from various domains, including biology and cloud monitoring, we demonstrate that the true graph is not falsified by our metric, whereas the wrong graphs given by a hypothetical user are likely to be falsified.
Paper Structure (56 sections, 5 theorems, 17 equations, 13 figures, 5 tables)

This paper contains 56 sections, 5 theorems, 17 equations, 13 figures, 5 tables.

Key Result

Theorem 1

A probability distribution $P$ is Markov relative to a DAG ${\mathcal{G}} = (\bm{V}, \mathcal{E})$, iff $X_i \mathrel{\text{$\perp\mkern-10mu\perp$}}_P \text{ND}_{i}^{{\mathcal{G}}} \setminus\text{Pa}_{i}^{{\mathcal{G}}} \mid \text{Pa}_{i}^{{\mathcal{G}}}$.

Figures (13)

  • Figure 1: (a) For monitoring micro-service architectures, domain experts may invert the call graph to obtain a causal graph of latencies and error rates Budhathoki2022; (b) However, there are multiple reasons why this graph might violate independence statements on observed data.
  • Figure 2: Type I error rate at $\alpha=5\%$ for different sizes $D$ of the conditioning set for one parametric (partial correlation) and two nonparametric CI tests, KCI zhang2011 and GCM shah2020. Data (solid: $N$ = 100, dashed: $N$ = 500) were sampled from gaussian-linear conditionals. More details are given in Supp. \ref{['sec:app_type_1_ci']}.
  • Figure 3: Mean $p_\text{LMC}$ for two types of domain experts, simulated via DE-$\bm{V}$ (left; smaller numbers correspond to less domain knowledge) and DE-$\mathcal{E}$ (right; smaller numbers correspond to more domain knowledge). On synthetic data (\ref{['fig:p_lmc_synthetic_nonlinear']}) for the true DAG ($|K|/|\bm{V}| = 1$; $\text{SHD}/|\mathcal{E}|=0$), we reject the null that the DAG is as bad as random with $\alpha=1\%$ for all configurations. ${\hat{\mathcal{G}}}$ is falsified with the same $\alpha$ if $|K|/|\bm{V}|\leq 0.6$ or $\text{SHD}/|\mathcal{E}|\geq 1.5$. On real-world data (\ref{['fig:p_lmc_real']}), for the true DAG, we reject the null that it is as bad as random with $\alpha=5\%$ and ${\hat{\mathcal{G}}}$ is falsified with the same $\alpha$ for $|K|/|\bm{V}|\leq 0.8$ or $\text{SHD}/|\mathcal{E}|\geq 0.5$ for all datasets.
  • Figure A1: Fraction of LMC violations for increasing number of nodes / connectivity. We generated data with $N=400$ from ER-$n$-$d$ graphs with nonlinear conditionals. LMC violations were tested (with $\alpha=5\%$) using KCI. We applied either no, Bonferroni, or Benjamin/Hochberg correction.
  • Figure A2: (\ref{['fig:true_dag']}) True DAG ${\mathcal{G}^{*}}$; (\ref{['fig:domain_expert_nodes']}) ${\hat{\mathcal{G}}}$ from a DE-$\bm{V}$, where $K = \{X_3, X_4\}$ () and the remaining 3 edges are randomly shuffled (); (\ref{['fig:domain_expert_edges']}) ${\hat{\mathcal{G}}}$ from a DE-$\mathcal{E}$, where $N = \{(X_1, X_3)\}$ (), $M = \{(X_1, X_2)\}$, and $L=\{(X_2, X_3)\}$ ().
  • ...and 8 more figures

Theorems & Definitions (15)

  • Theorem 1: Parental Markov condition pearl2009
  • Definition 1: Parental triples
  • Definition 2: Violations of LMCs
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Theorem 2
  • proof
  • proof
  • proof
  • ...and 5 more