Table of Contents
Fetching ...

Efficient Differentiable Discovery of Causal Order

Mathieu Chevalley, Arash Mehrjou, Patrick Schwab

TL;DR

This work reformulating Intersort using differentiable sorting and ranking techniques enables scalable and differentiable optimization of causal orderings, allowing the continuous score function to be incorporated as a regularizer in downstream tasks.

Abstract

In the algorithm Intersort, Chevalley et al. (2024) proposed a score-based method to discover the causal order of variables in a Directed Acyclic Graph (DAG) model, leveraging interventional data to outperform existing methods. However, as a score-based method over the permutahedron, Intersort is computationally expensive and non-differentiable, limiting its ability to be utilised in problems involving large-scale datasets, such as those in genomics and climate models, or to be integrated into end-to-end gradient-based learning frameworks. We address this limitation by reformulating Intersort using differentiable sorting and ranking techniques. Our approach enables scalable and differentiable optimization of causal orderings, allowing the continuous score function to be incorporated as a regularizer in downstream tasks. Empirical results demonstrate that causal discovery algorithms benefit significantly from regularizing on the causal order, underscoring the effectiveness of our method. Our work opens the door to efficiently incorporating regularization for causal order into the training of differentiable models and thereby addresses a long-standing limitation of purely associational supervised learning.

Efficient Differentiable Discovery of Causal Order

TL;DR

This work reformulating Intersort using differentiable sorting and ranking techniques enables scalable and differentiable optimization of causal orderings, allowing the continuous score function to be incorporated as a regularizer in downstream tasks.

Abstract

In the algorithm Intersort, Chevalley et al. (2024) proposed a score-based method to discover the causal order of variables in a Directed Acyclic Graph (DAG) model, leveraging interventional data to outperform existing methods. However, as a score-based method over the permutahedron, Intersort is computationally expensive and non-differentiable, limiting its ability to be utilised in problems involving large-scale datasets, such as those in genomics and climate models, or to be integrated into end-to-end gradient-based learning frameworks. We address this limitation by reformulating Intersort using differentiable sorting and ranking techniques. Our approach enables scalable and differentiable optimization of causal orderings, allowing the continuous score function to be incorporated as a regularizer in downstream tasks. Empirical results demonstrate that causal discovery algorithms benefit significantly from regularizing on the causal order, underscoring the effectiveness of our method. Our work opens the door to efficiently incorporating regularization for causal order into the training of differentiable models and thereby addresses a long-standing limitation of purely associational supervised learning.

Paper Structure

This paper contains 11 sections, 1 theorem, 14 equations, 12 figures.

Key Result

Theorem 1

Let ${\mathbb{P}} = \mathop{\mathrm{arg\,max}}\limits_{{\bm{p}}} S({\bm{p}}, \epsilon, D, \mathcal{I}, P_X^{\mathcal{C}, (\emptyset)}, \mathcal{P}_{int}, c) \text{ s.t. } {\bm{p}}_i \neq {\bm{p}}_j \forall i, j \in \{1, \cdots, d\}$ be the set of potentials that maximize the score, such that no two

Figures (12)

  • Figure 1: Simulation and comparison between the bounds of Thm 2 and 4 of chevalley2024deriving for Erdős-Rényi (ER, left) and scale-free networks (SF, right) for $2000$ variables. We compare the causal order obtained by maximizing our proposed DiffIntersort score and the output of SORTRANKING. For each setting, we draw $1$ graphs per setting, following a ER distribution with a probability of edges per variable $p_{e}$ in $\{ 0.0001, 0.00005, 0.00002 \}$ and following a Barabasi-Albert SF distribution, with average edge per variable in $\{1, 2, 3\}$. A setting is the tuple $(p_{int}, p_{e})$, where $p_{e} = \frac{2\mathrm{E}(\#edges)}{d(d-1)}$ for the SF distribution. Then, for each graph, we run the algorithm on $1$ configuration, where each configuration corresponds to a draw of the targeted variables following $p_{int}$. We have $p_{int} \in \{0.25, 0.33, 0.5, 0.66, 0.75\}$. The settings are ordered on the x-axis following what is called the effective intervention ratio $\frac{p_{int}}{\sqrt{p_e}}$chevalley2024deriving.
  • Figure 2: Top order diverge scores (lower is better) assessing the quality of the derived causal order, comparing our method based on the DiffIntersort score to SORTRANKING on $100$ variables, for various types of data.
  • Figure 3: Comparison of SHD (lower is better) for GRN, Linear, RFF, and Neural Network data with varying numbers for $30$ variables. Our method (CausalDisco with and without constraint) achieves lower SHD values compared to baseline methods on GRN and RFF data. GIES outperforms on the linear data and DCDI performs slightly better on NN data.
  • Figure 4: F1 score of our algorithm with DiffIntersort constraint for the four considered data types over the fraction of intervened variables for 10, 30 and 100 variables. As can be observed, the performance is consistent across the scale of the number of variables as there is no major drop in performance at $100$ variables compared to $10$ and $30$ variables.
  • Figure 5: Comparison of performance on simulated ER graphs in terms of $D_{top}$ divergence between the two bounds of chevalley2024deriving, DiffIntersort, Intersort and SORTRANKING. For each setting, we draw multiple graphs, where a setting is the tuple $(p_{int}, p_{e})$. Then, for each graph, we run the algorithm on multiple configurations, where a configuration corresponds to a set of intervened variables following $p_{int}$. We have $p_{int} \in \{0.25, 0.33, 0.5, 0.66, 0.75\}$ for all scales. For $5$ variables, we have $p_{e} \in \{0.5, 0.66, 0.75\}$. For $30$, we have $p_{e} \in \{0.05, 0.1, 0.2\}$. For $1000$ variables, we have $p_{e} \in \{0.005, 0.002, 0.001\}$. For $20000$ variables settings, we have $p_{e} \in \{0.0001, 0.00005, 0.00002\}$. Those edge probabilities approximately correspond to an average of $1$, $2$ or $3$ edges per variable.
  • ...and 7 more figures

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3: chevalley2024deriving
  • Theorem 1
  • proof : Proof of Theorem \ref{['thm:1']}