Boosting Causal Additive Models

Maximilian Kertel; Nadja Klein

Boosting Causal Additive Models

Maximilian Kertel, Nadja Klein

TL;DR

The paper tackles causal discovery under additive noise models by learning a CAM-based DAG through boosting. It introduces a regression-based score over topological orderings and proves that $L^2$-boosting with early stopping yields a consistent ordering, even under misspecification. A high-dimensional extension uses component-wise boosting with an AIC-driven stopping rule and pruning to scale to large graphs; this variant remains competitive with state-of-the-art methods in simulations. Overall, the work provides a principled, tunable framework that links regression-consistency, variance estimation, and graph identifiability to robust causal-order discovery in both low- and high-dimensional regimes.

Abstract

We present a boosting-based method to learn additive Structural Equation Models (SEMs) from observational data, with a focus on the theoretical aspects of determining the causal order among variables. We introduce a family of score functions based on arbitrary regression techniques, for which we establish necessary conditions to consistently favor the true causal ordering. Our analysis reveals that boosting with early stopping meets these criteria and thus offers a consistent score function for causal orderings. To address the challenges posed by high-dimensional data sets, we adapt our approach through a component-wise gradient descent in the space of additive SEMs. Our simulation study underlines our theoretical results for lower dimensions and demonstrates that our high-dimensional adaptation is competitive with state-of-the-art methods. In addition, it exhibits robustness with respect to the choice of the hyperparameters making the procedure easy to tune.

Boosting Causal Additive Models

TL;DR

The paper tackles causal discovery under additive noise models by learning a CAM-based DAG through boosting. It introduces a regression-based score over topological orderings and proves that

-boosting with early stopping yields a consistent ordering, even under misspecification. A high-dimensional extension uses component-wise boosting with an AIC-driven stopping rule and pruning to scale to large graphs; this variant remains competitive with state-of-the-art methods in simulations. Overall, the work provides a principled, tunable framework that links regression-consistency, variance estimation, and graph identifiability to robust causal-order discovery in both low- and high-dimensional regimes.

Abstract

Paper Structure (34 sections, 20 theorems, 143 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 34 sections, 20 theorems, 143 equations, 2 figures, 5 tables, 1 algorithm.

Introduction
Causal Discovery
Identifiability
Estimation of the Ordering
Sketch of the proof of Proposition \ref{['prop:consistency_empirical_score']}
Background and Preliminaries
Boosting
Reproducing Kernel Hilbert Spaces
Boosting DAGs
Assumptions
Main Theorem
Boosting under Misspecification
Consistency of Variance Estimation
Lower bound:
Upper bound:
...and 19 more sections

Key Result

Proposition 1

Let Assumption ass:sem_type hold. Then, if the regression estimator is such that Then it holds for the derived score function $\widehat{S}$ that for any $\pi^0 \in \Pi(G^0)$ and $\pi \notin \Pi(G^0)$ with probability going to $1$ for $N \rightarrow \infty$.

Figures (2)

Figure 1: An example of a SEM on the left-hand side and its corresponding graph $G$ on the right-hand-side for $p=3$. The set of possible topological orderings is $\{(2,1,3), (2,3,1)\}$. For $\pi^0 = (2,3,1)$ it holds $X_1 = f_1(X_2) + \varepsilon_1 = f_{12}(X_2) + f_{13}(X_3) + \varepsilon_1$ with $f_{12} = f_1$ and $f_{13} = 0$.
Figure 2: The blue dots represent $500$ realizations of a distribution following a SEM with $p=2$ and $X_1 = \varepsilon_1 \sim \mathcal{N}(0, 1)$ and $X_2 = -3\cos(X_1) + \varepsilon_2$ with $\varepsilon_2 \sim \mathcal{N}(0, 1)$. On the left-hand-side we plot $X_2$ on the $y$-axis and $X_1$ on the $x$-axis, while on the right-hand-side it is vice versa. The red lines give the conditional mean functions. We see that $\mathop{\mathrm{arg\,min}}\limits_{(f_1, f_2) \in \vartheta((1,2))} \sum_{k=1}^2\log(\sigma^2_{k, p_{\theta^0}, f_k, (1,2)}) = (0, -3\cos(x_1))$ and $\mathop{\mathrm{arg\,min}}\limits_{(f_1, f_2) \in \vartheta((2,1))} \sum_{k=1}^2\log(\sigma^2_{k, p_{\theta^0}, f_k, (2,1)}) = (0, 0)$. The distribution $X_1 - \mathbf{E}\left[X_1|X_2 = x_2\right]$ becomes bi-modal for larger values of $x_2$. The unexplained noise (distance of blue dots to red line) is smaller on the left, which is the correct ordering, thus $S((1,2)) < S((2,1))$.

Theorems & Definitions (48)

Remark 1
Remark 2
Definition 1: Non-overfitting
Proposition 1
Definition 2
Example 1: Kernel functions
Remark 3
Definition 3
Remark 4
Theorem 5
...and 38 more

Boosting Causal Additive Models

TL;DR

Abstract

Boosting Causal Additive Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (48)