Assumption violations in causal discovery and the robustness of score matching
Francesco Montagna, Atalanti A. Mastakouri, Elias Eulig, Nicoletta Noceti, Lorenzo Rosasco, Dominik Janzing, Bryon Aragam, Francesco Locatello
TL;DR
This work tackles the robustness of causal-discovery methods when core assumptions are violated in observational data. It conducts a large-scale benchmark of 11 algorithms across diverse synthetic backgrounds, including misspecifications such as confounding, measurement error, and unfaithfulness, using $d\in\{5,10,20,50\}$ and $n\in\{100,1000\}$ with both sparse and dense graphs. The key finding is that score-matching-based approaches (e.g., SCORE, NoGAM, DAS) show surprising robustness in inferring causal order and, with CAM-pruning, reasonable edge recovery, even under several misspecifications; other methods often degrade substantially. The paper also contributes a Python library for data generation and benchmarking, and highlights hyperparameter stability as an important dimension for evaluating causal-discovery methods in practice.
Abstract
When domain knowledge is limited and experimentation is restricted by ethical, financial, or time constraints, practitioners turn to observational causal discovery methods to recover the causal structure, exploiting the statistical properties of their data. Because causal discovery without further assumptions is an ill-posed problem, each algorithm comes with its own set of usually untestable assumptions, some of which are hard to meet in real datasets. Motivated by these considerations, this paper extensively benchmarks the empirical performance of recent causal discovery methods on observational i.i.d. data generated under different background conditions, allowing for violations of the critical assumptions required by each selected approach. Our experimental findings show that score matching-based methods demonstrate surprising performance in the false positive and false negative rate of the inferred graph in these challenging scenarios, and we provide theoretical insights into their performance. This work is also the first effort to benchmark the stability of causal discovery algorithms with respect to the values of their hyperparameters. Finally, we hope this paper will set a new standard for the evaluation of causal discovery methods and can serve as an accessible entry point for practitioners interested in the field, highlighting the empirical implications of different algorithm choices.
