Table of Contents
Fetching ...

Assumption violations in causal discovery and the robustness of score matching

Francesco Montagna, Atalanti A. Mastakouri, Elias Eulig, Nicoletta Noceti, Lorenzo Rosasco, Dominik Janzing, Bryon Aragam, Francesco Locatello

TL;DR

This work tackles the robustness of causal-discovery methods when core assumptions are violated in observational data. It conducts a large-scale benchmark of 11 algorithms across diverse synthetic backgrounds, including misspecifications such as confounding, measurement error, and unfaithfulness, using $d\in\{5,10,20,50\}$ and $n\in\{100,1000\}$ with both sparse and dense graphs. The key finding is that score-matching-based approaches (e.g., SCORE, NoGAM, DAS) show surprising robustness in inferring causal order and, with CAM-pruning, reasonable edge recovery, even under several misspecifications; other methods often degrade substantially. The paper also contributes a Python library for data generation and benchmarking, and highlights hyperparameter stability as an important dimension for evaluating causal-discovery methods in practice.

Abstract

When domain knowledge is limited and experimentation is restricted by ethical, financial, or time constraints, practitioners turn to observational causal discovery methods to recover the causal structure, exploiting the statistical properties of their data. Because causal discovery without further assumptions is an ill-posed problem, each algorithm comes with its own set of usually untestable assumptions, some of which are hard to meet in real datasets. Motivated by these considerations, this paper extensively benchmarks the empirical performance of recent causal discovery methods on observational i.i.d. data generated under different background conditions, allowing for violations of the critical assumptions required by each selected approach. Our experimental findings show that score matching-based methods demonstrate surprising performance in the false positive and false negative rate of the inferred graph in these challenging scenarios, and we provide theoretical insights into their performance. This work is also the first effort to benchmark the stability of causal discovery algorithms with respect to the values of their hyperparameters. Finally, we hope this paper will set a new standard for the evaluation of causal discovery methods and can serve as an accessible entry point for practitioners interested in the field, highlighting the empirical implications of different algorithm choices.

Assumption violations in causal discovery and the robustness of score matching

TL;DR

This work tackles the robustness of causal-discovery methods when core assumptions are violated in observational data. It conducts a large-scale benchmark of 11 algorithms across diverse synthetic backgrounds, including misspecifications such as confounding, measurement error, and unfaithfulness, using and with both sparse and dense graphs. The key finding is that score-matching-based approaches (e.g., SCORE, NoGAM, DAS) show surprising robustness in inferring causal order and, with CAM-pruning, reasonable edge recovery, even under several misspecifications; other methods often degrade substantially. The paper also contributes a Python library for data generation and benchmarking, and highlights hyperparameter stability as an important dimension for evaluating causal-discovery methods in practice.

Abstract

When domain knowledge is limited and experimentation is restricted by ethical, financial, or time constraints, practitioners turn to observational causal discovery methods to recover the causal structure, exploiting the statistical properties of their data. Because causal discovery without further assumptions is an ill-posed problem, each algorithm comes with its own set of usually untestable assumptions, some of which are hard to meet in real datasets. Motivated by these considerations, this paper extensively benchmarks the empirical performance of recent causal discovery methods on observational i.i.d. data generated under different background conditions, allowing for violations of the critical assumptions required by each selected approach. Our experimental findings show that score matching-based methods demonstrate surprising performance in the false positive and false negative rate of the inferred graph in these challenging scenarios, and we provide theoretical insights into their performance. This work is also the first effort to benchmark the stability of causal discovery algorithms with respect to the values of their hyperparameters. Finally, we hope this paper will set a new standard for the evaluation of causal discovery methods and can serve as an accessible entry point for practitioners interested in the field, highlighting the empirical implications of different algorithm choices.
Paper Structure (58 sections, 7 theorems, 30 equations, 20 figures, 2 tables, 1 algorithm)

This paper contains 58 sections, 7 theorems, 30 equations, 20 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Let $\mathbf{X} \in \mathbb{R}^d$ be generated according to the post nonlinear model eq:pnl. Then, the score function of a leaf node $X_l$ satisfies $s_l(\mathbf{X}) = \partial_l \log p_l(U_l)$.

Figures (20)

  • Figure 1: Experimental results on the misspecified scenarios. For each method, we also display the violin plot of its performance on the vanilla scenario with transparent color. F1 score (the higher the better) and FNR-$\hat{\pi}$ (the lower the better) are evaluated over $20$ seeds on Erdos-Renyi dense graphs with $20$ nodes (ER-20 dense). FNR-$\hat{\pi}$ is not computed for GES and PC, methods whose output is a CPDAG. Note that DirectLiNGAM performance is reported in Appendix \ref{['app:non_gauss_exp']}, on data under non-Gaussian distribution of the noise terms.
  • Figure 2: Gaussian noise (left) transformed via random nonlinear functions (center) to non-Gaussian iid noise (right). Weights of the MLP are sampled from either (\ref{['fig:random_dist_0.5']}) $U(-0.5, 0.5)$, (\ref{['fig:random_dist_1.5']}) $U(-1.5, 1.5)$, or (\ref{['fig:random_dist_3']}) $U(-3.0, 3.0)$.
  • Figure 3: In order to evaluate the goodness of the inferred ordering of GraN-DAG, we sample one topological order at random between those admitted by the adjacency matrix before the CAM-pruning step. In this figure, we compare the empirical FNR-$\hat{\pi}$ of an order randomly sampled between those admitted by the output, against the average of the FNR-$\hat{\pi}$ computed on the set of all possible orderings admitted by the output. We see that selecting an order at random gives an unbiased representation of the average order accuracy, between those admitted by GraN-DAG output before the CAM-pruning. The violin plots refer to the FNR-$\hat{\pi}$ evaluated on ER graphs with $10$ nodes over $20$ different random seeds.
  • Figure 4: The violin plots in the figure represent the difference between the F1 score of a method running inference with hyperparameters optimized using the ground truth, versus the F1 score of the same method using a default value of the hyperparameters. We denote this difference with $|\textnormal{f1}_\textnormal{diff}|$. In the case of GES, we define as default $\lambda=0.5$. For all the remaining methods, the default alpha threshold is defined as $\alpha=0.05$. The violin plots refer to the inference performance on datasets and graphs generated according to $20$ different random seeds. Results in the table are on data generated from the vanilla scenario, and we consider Erdos-Renyi graphs with the number of nodes in $\{5, 10, 20, 50\}$ in the dense and sparse settings.
  • Figure 5: F1 score and FNR-$\hat{\pi}$ on data generated with non-Gaussian distribution of the noise terms (c.f. Figure \ref{['fig:random_dist_3']}). For each method, we also display the violin plot of its performance on the vanilla scenario with Gaussian noise terms, with transparent color. F1 score (the higher the better) and FNR-$\hat{\pi}$ (the lower the better) are evaluated over $20$ seeds on Erdos-Renyi dense graphs with $20$ nodes (ER-20 dense). FNR-$\hat{\pi}$ is not computed for GES and PC methods, whose output is a CPDAG.
  • ...and 15 more figures

Theorems & Definitions (13)

  • Proposition 1
  • Definition 1
  • Proposition 2
  • Definition 2
  • Lemma 1: Lemma 1 of rolland22_score
  • Lemma 2: Lemma 1 of montagna23_nogam
  • Lemma 3: Lemma 1 of montagna23_das
  • proof
  • Lemma 4
  • proof
  • ...and 3 more