Toward Falsifying Causal Graphs Using a Permutation-Based Test
Elias Eulig, Atalanti A. Mastakouri, Patrick Blöbaum, Michaela Hardt, Dominik Janzing
TL;DR
This work tackles the challenge of validating a user-provided causal DAG ${\hat{\mathcal{G}}}$ against observational data when the true graph ${\mathcal{G}}^{*}$ is unknown. It introduces a permutation-based baseline by randomly permuting graph nodes to create a null distribution of local Markov condition (LMC) violations, enabling a principled p-value $p_{\text{LMC}}$ and a related falsifiability metric $p_{\text{TPa}}$. The paper formalizes the baseline, proves uniform sampling from the DAG orbit, and demonstrates the method on synthetic and real datasets (including Sachs, Auto MPG, and AWS APM), showing that true graphs are not falsified by the metric while plausible incorrect graphs are often rejected. This approach provides a practical, interpretable benchmark for evaluating causal graphs in the absence of an established baseline, with implications for guiding causal discovery and domain-specific graph refinement.
Abstract
Understanding causal relationships among the variables of a system is paramount to explain and control its behavior. For many real-world systems, however, the true causal graph is not readily available and one must resort to predictions made by algorithms or domain experts. Therefore, metrics that quantitatively assess the goodness of a causal graph provide helpful checks before using it in downstream tasks. Existing metrics provide an $\textit{absolute}$ number of inconsistencies between the graph and the observed data, and without a baseline, practitioners are left to answer the hard question of how many such inconsistencies are acceptable or expected. Here, we propose a novel consistency metric by constructing a baseline through node permutations. By comparing the number of inconsistencies with those on the baseline, we derive an interpretable metric that captures whether the graph is significantly better than random. Evaluating on both simulated and real data sets from various domains, including biology and cloud monitoring, we demonstrate that the true graph is not falsified by our metric, whereas the wrong graphs given by a hypothetical user are likely to be falsified.
