Table of Contents
Fetching ...

Causal Discovery over High-Dimensional Structured Hypothesis Spaces with Causal Graph Partitioning

Ashka Shah, Adela DePavia, Nathaniel Hudson, Ian Foster, Rick Stevens

TL;DR

The authors address the challenge of scalable causal discovery in high-dimensional spaces by introducing a causal partition that leverages a superstructure to partition the hypothesis space and enable divide-and-conquer learning. They prove that, under faithfulness and consistency of the local structure learners, merging local PAG outputs via a Screen procedure recovers the CPDAG $H^*$ corresponding to the MEC of the true graph $G^*$, thereby achieving consistency in the infinite-data limit. The proposed method supports arbitrary partitions extended from a disjoint partition and requires no extra learning step to merge subsets, yielding substantial speedups in synthetic and gene-regulatory-network-like tasks up to ${10^4}$ variables. Empirical results show favorable trade-offs between accuracy and runtime across random and biologically-tuned networks, with particularly strong gains in large-scale problems and robustness to imperfect superstructures. The work thus enables reliable causal discovery in domains such as gene regulatory networks where high dimensionality previously impeded scalable inference.

Abstract

The aim in many sciences is to understand the mechanisms that underlie the observed distribution of variables, starting from a set of initial hypotheses. Causal discovery allows us to infer mechanisms as sets of cause and effect relationships in a generalized way -- without necessarily tailoring to a specific domain. Causal discovery algorithms search over a structured hypothesis space, defined by the set of directed acyclic graphs, to find the graph that best explains the data. For high-dimensional problems, however, this search becomes intractable and scalable algorithms for causal discovery are needed to bridge the gap. In this paper, we define a novel causal graph partition that allows for divide-and-conquer causal discovery with theoretical guarantees. We leverage the idea of a superstructure -- a set of learned or existing candidate hypotheses -- to partition the search space. We prove under certain assumptions that learning with a causal graph partition always yields the Markov Equivalence Class of the true causal graph. We show our algorithm achieves comparable accuracy and a faster time to solution for biologically-tuned synthetic networks and networks up to ${10^4}$ variables. This makes our method applicable to gene regulatory network inference and other domains with high-dimensional structured hypothesis spaces.

Causal Discovery over High-Dimensional Structured Hypothesis Spaces with Causal Graph Partitioning

TL;DR

The authors address the challenge of scalable causal discovery in high-dimensional spaces by introducing a causal partition that leverages a superstructure to partition the hypothesis space and enable divide-and-conquer learning. They prove that, under faithfulness and consistency of the local structure learners, merging local PAG outputs via a Screen procedure recovers the CPDAG corresponding to the MEC of the true graph , thereby achieving consistency in the infinite-data limit. The proposed method supports arbitrary partitions extended from a disjoint partition and requires no extra learning step to merge subsets, yielding substantial speedups in synthetic and gene-regulatory-network-like tasks up to variables. Empirical results show favorable trade-offs between accuracy and runtime across random and biologically-tuned networks, with particularly strong gains in large-scale problems and robustness to imperfect superstructures. The work thus enables reliable causal discovery in domains such as gene regulatory networks where high dimensionality previously impeded scalable inference.

Abstract

The aim in many sciences is to understand the mechanisms that underlie the observed distribution of variables, starting from a set of initial hypotheses. Causal discovery allows us to infer mechanisms as sets of cause and effect relationships in a generalized way -- without necessarily tailoring to a specific domain. Causal discovery algorithms search over a structured hypothesis space, defined by the set of directed acyclic graphs, to find the graph that best explains the data. For high-dimensional problems, however, this search becomes intractable and scalable algorithms for causal discovery are needed to bridge the gap. In this paper, we define a novel causal graph partition that allows for divide-and-conquer causal discovery with theoretical guarantees. We leverage the idea of a superstructure -- a set of learned or existing candidate hypotheses -- to partition the search space. We prove under certain assumptions that learning with a causal graph partition always yields the Markov Equivalence Class of the true causal graph. We show our algorithm achieves comparable accuracy and a faster time to solution for biologically-tuned synthetic networks and networks up to variables. This makes our method applicable to gene regulatory network inference and other domains with high-dimensional structured hypothesis spaces.
Paper Structure (35 sections, 11 theorems, 12 equations, 11 figures, 4 tables, 5 algorithms)

This paper contains 35 sections, 11 theorems, 12 equations, 11 figures, 4 tables, 5 algorithms.

Key Result

Lemma 1

Given $\mathscr{A}$ satisfying Assumption assume:consistent_PAG_learner,

Figures (11)

  • Figure 1: Examples of latent MAGS $L^{\text{MAG}}(G^*, S)$. Inducing paths $\Pi$ relative to $V\setminus S$ are highlighted in green. (a) For $x_1, x_2\in S$, any edge $(x_1,x_2)$ in $G^*$ is an inducing path relative to $V\setminus S$ between $x_1$ and $x_2$. (b) $\Pi$ is an inducing path relative to $V\setminus S$ between $x_1$ and $x_5$ because all non-endpoint nodes on the path are in $V\setminus S$. (c) $\Pi$ is an inducing path relative to $V\setminus S$ between $x_1$ and $x_5$ because every non-endpoint is either in $V\setminus S$ (nodes $x_2, x_4$), or is in $S$and is a collider on the path and is an ancestor of at least one of $x_1$ or $x_5$ (node $x_3$).
  • Figure 2: Expansive causal partition $\{S'_1,S'_2\}$ made from initial disjoint partition $\{S_1,S_2\}$.
  • Figure 3: Experiment increasing the number of samples $\bm{n}$. Error bars are 95% confidence intervals.
  • Figure 4: Experiment increasing fraction of extraneous edges in a perfect superstructure.
  • Figure 5: Increase in density of the imperfect superstructure by increasing the significant level $\bm{\alpha}$ of the PC algorithm.
  • ...and 6 more figures

Theorems & Definitions (25)

  • Definition 3.1: mixed graph, MAG
  • Definition 3.2: Inducing path
  • Definition 3.3: partial mixed graph, PAG
  • Definition 3.4: Latent MAG
  • Lemma 1
  • Definition 3.5: Causal Partition
  • Theorem 1
  • Definition 5.1
  • Lemma 2
  • Definition A.1: Collider on a path
  • ...and 15 more