Assumption violations in causal discovery and the robustness of score matching

Francesco Montagna; Atalanti A. Mastakouri; Elias Eulig; Nicoletta Noceti; Lorenzo Rosasco; Dominik Janzing; Bryon Aragam; Francesco Locatello

Assumption violations in causal discovery and the robustness of score matching

Francesco Montagna, Atalanti A. Mastakouri, Elias Eulig, Nicoletta Noceti, Lorenzo Rosasco, Dominik Janzing, Bryon Aragam, Francesco Locatello

TL;DR

This work tackles the robustness of causal-discovery methods when core assumptions are violated in observational data. It conducts a large-scale benchmark of 11 algorithms across diverse synthetic backgrounds, including misspecifications such as confounding, measurement error, and unfaithfulness, using $d\in\{5,10,20,50\}$ and $n\in\{100,1000\}$ with both sparse and dense graphs. The key finding is that score-matching-based approaches (e.g., SCORE, NoGAM, DAS) show surprising robustness in inferring causal order and, with CAM-pruning, reasonable edge recovery, even under several misspecifications; other methods often degrade substantially. The paper also contributes a Python library for data generation and benchmarking, and highlights hyperparameter stability as an important dimension for evaluating causal-discovery methods in practice.

Abstract

When domain knowledge is limited and experimentation is restricted by ethical, financial, or time constraints, practitioners turn to observational causal discovery methods to recover the causal structure, exploiting the statistical properties of their data. Because causal discovery without further assumptions is an ill-posed problem, each algorithm comes with its own set of usually untestable assumptions, some of which are hard to meet in real datasets. Motivated by these considerations, this paper extensively benchmarks the empirical performance of recent causal discovery methods on observational i.i.d. data generated under different background conditions, allowing for violations of the critical assumptions required by each selected approach. Our experimental findings show that score matching-based methods demonstrate surprising performance in the false positive and false negative rate of the inferred graph in these challenging scenarios, and we provide theoretical insights into their performance. This work is also the first effort to benchmark the stability of causal discovery algorithms with respect to the values of their hyperparameters. Finally, we hope this paper will set a new standard for the evaluation of causal discovery methods and can serve as an accessible entry point for practitioners interested in the field, highlighting the empirical implications of different algorithm choices.

Assumption violations in causal discovery and the robustness of score matching

TL;DR

and

with both sparse and dense graphs. The key finding is that score-matching-based approaches (e.g., SCORE, NoGAM, DAS) show surprising robustness in inferring causal order and, with CAM-pruning, reasonable edge recovery, even under several misspecifications; other methods often degrade substantially. The paper also contributes a Python library for data generation and benchmarking, and highlights hyperparameter stability as an important dimension for evaluating causal-discovery methods in practice.

Abstract

Paper Structure (58 sections, 7 theorems, 30 equations, 20 figures, 2 tables, 1 algorithm)

This paper contains 58 sections, 7 theorems, 30 equations, 20 figures, 2 tables, 1 algorithm.

Introduction
The causal model
Problem definition
Identifiable models
Experimental design
Datasets
Misspecified scenarios
Data generation
Methods
Deepdive on SCORE, NoGAM and DiffAN
DAS.
Key experimental results and analysis
Can current methods infer causality when assumptions on the data are violated?
Discussion on PC and GES performance
Discussion on score matching robustness
...and 43 more sections

Key Result

Proposition 1

Let $\mathbf{X} \in \mathbb{R}^d$ be generated according to the post nonlinear model eq:pnl. Then, the score function of a leaf node $X_l$ satisfies $s_l(\mathbf{X}) = \partial_l \log p_l(U_l)$.

Figures (20)

Figure 1: Experimental results on the misspecified scenarios. For each method, we also display the violin plot of its performance on the vanilla scenario with transparent color. F1 score (the higher the better) and FNR-$\hat{\pi}$ (the lower the better) are evaluated over $20$ seeds on Erdos-Renyi dense graphs with $20$ nodes (ER-20 dense). FNR-$\hat{\pi}$ is not computed for GES and PC, methods whose output is a CPDAG. Note that DirectLiNGAM performance is reported in Appendix \ref{['app:non_gauss_exp']}, on data under non-Gaussian distribution of the noise terms.
Figure 2: Gaussian noise (left) transformed via random nonlinear functions (center) to non-Gaussian iid noise (right). Weights of the MLP are sampled from either (\ref{['fig:random_dist_0.5']}) $U(-0.5, 0.5)$, (\ref{['fig:random_dist_1.5']}) $U(-1.5, 1.5)$, or (\ref{['fig:random_dist_3']}) $U(-3.0, 3.0)$.
Figure 3: In order to evaluate the goodness of the inferred ordering of GraN-DAG, we sample one topological order at random between those admitted by the adjacency matrix before the CAM-pruning step. In this figure, we compare the empirical FNR-$\hat{\pi}$ of an order randomly sampled between those admitted by the output, against the average of the FNR-$\hat{\pi}$ computed on the set of all possible orderings admitted by the output. We see that selecting an order at random gives an unbiased representation of the average order accuracy, between those admitted by GraN-DAG output before the CAM-pruning. The violin plots refer to the FNR-$\hat{\pi}$ evaluated on ER graphs with $10$ nodes over $20$ different random seeds.
Figure 4: The violin plots in the figure represent the difference between the F1 score of a method running inference with hyperparameters optimized using the ground truth, versus the F1 score of the same method using a default value of the hyperparameters. We denote this difference with $|\textnormal{f1}_\textnormal{diff}|$. In the case of GES, we define as default $\lambda=0.5$. For all the remaining methods, the default alpha threshold is defined as $\alpha=0.05$. The violin plots refer to the inference performance on datasets and graphs generated according to $20$ different random seeds. Results in the table are on data generated from the vanilla scenario, and we consider Erdos-Renyi graphs with the number of nodes in $\{5, 10, 20, 50\}$ in the dense and sparse settings.
Figure 5: F1 score and FNR-$\hat{\pi}$ on data generated with non-Gaussian distribution of the noise terms (c.f. Figure \ref{['fig:random_dist_3']}). For each method, we also display the violin plot of its performance on the vanilla scenario with Gaussian noise terms, with transparent color. F1 score (the higher the better) and FNR-$\hat{\pi}$ (the lower the better) are evaluated over $20$ seeds on Erdos-Renyi dense graphs with $20$ nodes (ER-20 dense). FNR-$\hat{\pi}$ is not computed for GES and PC methods, whose output is a CPDAG.
...and 15 more figures

Theorems & Definitions (13)

Proposition 1
Definition 1
Proposition 2
Definition 2
Lemma 1: Lemma 1 of rolland22_score
Lemma 2: Lemma 1 of montagna23_nogam
Lemma 3: Lemma 1 of montagna23_das
proof
Lemma 4
proof
...and 3 more

Assumption violations in causal discovery and the robustness of score matching

TL;DR

Abstract

Assumption violations in causal discovery and the robustness of score matching

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (13)