Table of Contents
Fetching ...

Boosting Causal Additive Models

Maximilian Kertel, Nadja Klein

TL;DR

The paper tackles causal discovery under additive noise models by learning a CAM-based DAG through boosting. It introduces a regression-based score over topological orderings and proves that $L^2$-boosting with early stopping yields a consistent ordering, even under misspecification. A high-dimensional extension uses component-wise boosting with an AIC-driven stopping rule and pruning to scale to large graphs; this variant remains competitive with state-of-the-art methods in simulations. Overall, the work provides a principled, tunable framework that links regression-consistency, variance estimation, and graph identifiability to robust causal-order discovery in both low- and high-dimensional regimes.

Abstract

We present a boosting-based method to learn additive Structural Equation Models (SEMs) from observational data, with a focus on the theoretical aspects of determining the causal order among variables. We introduce a family of score functions based on arbitrary regression techniques, for which we establish necessary conditions to consistently favor the true causal ordering. Our analysis reveals that boosting with early stopping meets these criteria and thus offers a consistent score function for causal orderings. To address the challenges posed by high-dimensional data sets, we adapt our approach through a component-wise gradient descent in the space of additive SEMs. Our simulation study underlines our theoretical results for lower dimensions and demonstrates that our high-dimensional adaptation is competitive with state-of-the-art methods. In addition, it exhibits robustness with respect to the choice of the hyperparameters making the procedure easy to tune.

Boosting Causal Additive Models

TL;DR

The paper tackles causal discovery under additive noise models by learning a CAM-based DAG through boosting. It introduces a regression-based score over topological orderings and proves that -boosting with early stopping yields a consistent ordering, even under misspecification. A high-dimensional extension uses component-wise boosting with an AIC-driven stopping rule and pruning to scale to large graphs; this variant remains competitive with state-of-the-art methods in simulations. Overall, the work provides a principled, tunable framework that links regression-consistency, variance estimation, and graph identifiability to robust causal-order discovery in both low- and high-dimensional regimes.

Abstract

We present a boosting-based method to learn additive Structural Equation Models (SEMs) from observational data, with a focus on the theoretical aspects of determining the causal order among variables. We introduce a family of score functions based on arbitrary regression techniques, for which we establish necessary conditions to consistently favor the true causal ordering. Our analysis reveals that boosting with early stopping meets these criteria and thus offers a consistent score function for causal orderings. To address the challenges posed by high-dimensional data sets, we adapt our approach through a component-wise gradient descent in the space of additive SEMs. Our simulation study underlines our theoretical results for lower dimensions and demonstrates that our high-dimensional adaptation is competitive with state-of-the-art methods. In addition, it exhibits robustness with respect to the choice of the hyperparameters making the procedure easy to tune.
Paper Structure (34 sections, 20 theorems, 143 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 34 sections, 20 theorems, 143 equations, 2 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Let Assumption ass:sem_type hold. Then, if the regression estimator is such that Then it holds for the derived score function $\widehat{S}$ that for any $\pi^0 \in \Pi(G^0)$ and $\pi \notin \Pi(G^0)$ with probability going to $1$ for $N \rightarrow \infty$.

Figures (2)

  • Figure 1: An example of a SEM on the left-hand side and its corresponding graph $G$ on the right-hand-side for $p=3$. The set of possible topological orderings is $\{(2,1,3), (2,3,1)\}$. For $\pi^0 = (2,3,1)$ it holds $X_1 = f_1(X_2) + \varepsilon_1 = f_{12}(X_2) + f_{13}(X_3) + \varepsilon_1$ with $f_{12} = f_1$ and $f_{13} = 0$.
  • Figure 2: The blue dots represent $500$ realizations of a distribution following a SEM with $p=2$ and $X_1 = \varepsilon_1 \sim \mathcal{N}(0, 1)$ and $X_2 = -3\cos(X_1) + \varepsilon_2$ with $\varepsilon_2 \sim \mathcal{N}(0, 1)$. On the left-hand-side we plot $X_2$ on the $y$-axis and $X_1$ on the $x$-axis, while on the right-hand-side it is vice versa. The red lines give the conditional mean functions. We see that $\mathop{\mathrm{arg\,min}}\limits_{(f_1, f_2) \in \vartheta((1,2))} \sum_{k=1}^2\log(\sigma^2_{k, p_{\theta^0}, f_k, (1,2)}) = (0, -3\cos(x_1))$ and $\mathop{\mathrm{arg\,min}}\limits_{(f_1, f_2) \in \vartheta((2,1))} \sum_{k=1}^2\log(\sigma^2_{k, p_{\theta^0}, f_k, (2,1)}) = (0, 0)$. The distribution $X_1 - \mathbf{E}\left[X_1|X_2 = x_2\right]$ becomes bi-modal for larger values of $x_2$. The unexplained noise (distance of blue dots to red line) is smaller on the left, which is the correct ordering, thus $S((1,2)) < S((2,1))$.

Theorems & Definitions (48)

  • Remark 1
  • Remark 2
  • Definition 1: Non-overfitting
  • Proposition 1
  • Definition 2
  • Example 1: Kernel functions
  • Remark 3
  • Definition 3
  • Remark 4
  • Theorem 5
  • ...and 38 more