Table of Contents
Fetching ...

Standardizing Structural Causal Models

Weronika Ormaniec, Scott Sussex, Lars Lorch, Bernhard Schölkopf, Andreas Krause

TL;DR

The paper tackles the problem that synthetic data from structural causal models (SCMs) used for benchmarking structure learning contain artifacts, notably variance buildup and increasing R^2 correlations along causal order. It introduces internally-standardized SCMs (iSCMs), which standardize variables during the generative process, proving they avoid Var-sortability and largely avoid R^2-sortability, and analyzing their identifiability properties. The authors establish theoretical results showing that linear iSCMs do not deterministically collapse with graph depth, while standardized SCMs can be partially identifiable from observational data under certain weight assumptions; they contrast these with non-identifiability in linear Gaussian iSCMs for forests. Empirically, iSCMs remove exploitable covariance artifacts and still permit nontrivial structure learning by standard algorithms, suggesting iSCMs as a robust benchmarking tool and a promising modeling framework beyond benchmarking. The work provides code for reproducibility and situates iSCMs as stable, scale-free, unit-consistent models with potential broader applicability in causal inference.

Abstract

Synthetic datasets generated by structural causal models (SCMs) are commonly used for benchmarking causal structure learning algorithms. However, the variances and pairwise correlations in SCM data tend to increase along the causal ordering. Several popular algorithms exploit these artifacts, possibly leading to conclusions that do not generalize to real-world settings. Existing metrics like $\operatorname{Var}$-sortability and $\operatorname{R^2}$-sortability quantify these patterns, but they do not provide tools to remedy them. To address this, we propose internally-standardized structural causal models (iSCMs), a modification of SCMs that introduces a standardization operation at each variable during the generative process. By construction, iSCMs are not $\operatorname{Var}$-sortable. We also find empirical evidence that they are mostly not $\operatorname{R^2}$-sortable for commonly-used graph families. Moreover, contrary to the post-hoc standardization of data generated by standard SCMs, we prove that linear iSCMs are less identifiable from prior knowledge on the weights and do not collapse to deterministic relationships in large systems, which may make iSCMs a useful model in causal inference beyond the benchmarking problem studied here. Our code is publicly available at: https://github.com/werkaaa/iscm.

Standardizing Structural Causal Models

TL;DR

The paper tackles the problem that synthetic data from structural causal models (SCMs) used for benchmarking structure learning contain artifacts, notably variance buildup and increasing R^2 correlations along causal order. It introduces internally-standardized SCMs (iSCMs), which standardize variables during the generative process, proving they avoid Var-sortability and largely avoid R^2-sortability, and analyzing their identifiability properties. The authors establish theoretical results showing that linear iSCMs do not deterministically collapse with graph depth, while standardized SCMs can be partially identifiable from observational data under certain weight assumptions; they contrast these with non-identifiability in linear Gaussian iSCMs for forests. Empirically, iSCMs remove exploitable covariance artifacts and still permit nontrivial structure learning by standard algorithms, suggesting iSCMs as a robust benchmarking tool and a promising modeling framework beyond benchmarking. The work provides code for reproducibility and situates iSCMs as stable, scale-free, unit-consistent models with potential broader applicability in causal inference.

Abstract

Synthetic datasets generated by structural causal models (SCMs) are commonly used for benchmarking causal structure learning algorithms. However, the variances and pairwise correlations in SCM data tend to increase along the causal ordering. Several popular algorithms exploit these artifacts, possibly leading to conclusions that do not generalize to real-world settings. Existing metrics like -sortability and -sortability quantify these patterns, but they do not provide tools to remedy them. To address this, we propose internally-standardized structural causal models (iSCMs), a modification of SCMs that introduces a standardization operation at each variable during the generative process. By construction, iSCMs are not -sortable. We also find empirical evidence that they are mostly not -sortable for commonly-used graph families. Moreover, contrary to the post-hoc standardization of data generated by standard SCMs, we prove that linear iSCMs are less identifiable from prior knowledge on the weights and do not collapse to deterministic relationships in large systems, which may make iSCMs a useful model in causal inference beyond the benchmarking problem studied here. Our code is publicly available at: https://github.com/werkaaa/iscm.
Paper Structure (80 sections, 9 theorems, 62 equations, 21 figures, 5 tables, 2 algorithms)

This paper contains 80 sections, 9 theorems, 62 equations, 21 figures, 5 tables, 2 algorithms.

Key Result

Lemma 0

Let $\bf{x}$ be modeled by a linear SCM defined by eq:linear_scm with DAG $\mathcal{G}$ that satisfies $\operatorname{Var}[x_i] = 1$. Then, the covariance $\operatorname{Cov}[x_i, x_j]$ is the sum of products of the weights along all unblocked paths between the nodes of $x_i$ and $x_j$ in $\mathcal{ where $P_{j \leftrightarrow i}$ are all unblocked paths from $x_j$ to $x_i$ in $\mathcal{G}$, and $

Figures (21)

  • Figure 1: Standardizing SCMs two ways. Generative process for a chain graph of (a) standard SCMs, with data $\bf{x}$ standardized post-hoc, and (b) SCMs with standardization performed during the generative process (iSCMs). Dashed arrows indicate z-standardization. Solid arrows indicate linear functions with weights from ${\operatorname{Unif}_{\pm}[0.5, 2.0]}$ and additive noise from ${\mathcal{N}(0, 1)}$. We report absolute correlations ${\lvert\rho\rvert}$ of two consecutive observed variables, (a) $x_j^s$ and $x_{j+1}^s$, or (b) ${\widetilde{x}_{j}}$ and ${\widetilde{x}_{j+1}}$, averaged over 100000.0 models. In standard SCMs (a), correlations tend to increase along the causal ordering.
  • Figure 2: Causal mechanisms in iSCMs. The function $f_i$ modeling $x_i$ depends on the standardized ${\mathbf{\widetilde{x}}_{\mathrm{pa}(i)}}$. Dashing indicates z-standardization.
  • Figure 3: iSCMs with the same covariance matrix. (a) DAGs in an MEC with the same edge weights. (b) Covariance matrix for all linear iSCMs in (a) when $\alpha = 1$, $\beta = 2$.
  • Figure 4: $\operatorname{R^2}$-sortability for different graph sizes. Linear standardized SCMs and iSCMs with $\varepsilon_{i} \sim \mathcal{N}(0, 1)$ and weights drawn from uniform distributions with supports given above each plot. For every model, we evaluate 100.0 systems and $n\xspace =$1000.0 samples each. Lines and shaded regions denote mean and standard deviation. Datasets that satisfy $\operatorname{R^2}$-sortability $=0.5$ (dashed) are not $\operatorname{R^2}$-sortable.
  • Figure 5: Structure learning performance on SCM and iSCM data. F1 scores for recovering the edges of the true graph. Box plots show median and interquartile range (IQR). Whiskers extend to the largest value inside 1.5$\times$IQR from the boxes. Left (right) column shows results for linear (nonlinear) causal mechanisms with additive noise $\varepsilon_{i} \sim \mathcal{N}(0, 1)$ and $w_{i,j} \sim \operatorname{Unif}_{\pm [0.5, 2.0]}$ (Appendix E). For every model, we evaluate 20.0 systems each using $n\xspace =$1000.0 data points.
  • ...and 16 more figures

Theorems & Definitions (14)

  • Lemma 0: Covariance in linear SCMs with unit marginal variances
  • Theorem 1: Bound on $\smash{\operatorname{CEV_f}}$ in linear
  • Theorem 2: Partial identifiability of standardized linear SCMs with forest DAGs
  • Theorem 3: Nonidentifiability of linear Gaussian with forest DAGs
  • Lemma 3: Covariance in linear SCMs with unit marginal variances
  • proof
  • Theorem 3: Bound on $\smash{\operatorname{CEV_f}}$ in linear
  • proof
  • Lemma 3: Orientation of edges in undirected chains of standardized SCMs
  • proof
  • ...and 4 more