Causal Discovery over High-Dimensional Structured Hypothesis Spaces with Causal Graph Partitioning
Ashka Shah, Adela DePavia, Nathaniel Hudson, Ian Foster, Rick Stevens
TL;DR
The authors address the challenge of scalable causal discovery in high-dimensional spaces by introducing a causal partition that leverages a superstructure to partition the hypothesis space and enable divide-and-conquer learning. They prove that, under faithfulness and consistency of the local structure learners, merging local PAG outputs via a Screen procedure recovers the CPDAG $H^*$ corresponding to the MEC of the true graph $G^*$, thereby achieving consistency in the infinite-data limit. The proposed method supports arbitrary partitions extended from a disjoint partition and requires no extra learning step to merge subsets, yielding substantial speedups in synthetic and gene-regulatory-network-like tasks up to ${10^4}$ variables. Empirical results show favorable trade-offs between accuracy and runtime across random and biologically-tuned networks, with particularly strong gains in large-scale problems and robustness to imperfect superstructures. The work thus enables reliable causal discovery in domains such as gene regulatory networks where high dimensionality previously impeded scalable inference.
Abstract
The aim in many sciences is to understand the mechanisms that underlie the observed distribution of variables, starting from a set of initial hypotheses. Causal discovery allows us to infer mechanisms as sets of cause and effect relationships in a generalized way -- without necessarily tailoring to a specific domain. Causal discovery algorithms search over a structured hypothesis space, defined by the set of directed acyclic graphs, to find the graph that best explains the data. For high-dimensional problems, however, this search becomes intractable and scalable algorithms for causal discovery are needed to bridge the gap. In this paper, we define a novel causal graph partition that allows for divide-and-conquer causal discovery with theoretical guarantees. We leverage the idea of a superstructure -- a set of learned or existing candidate hypotheses -- to partition the search space. We prove under certain assumptions that learning with a causal graph partition always yields the Markov Equivalence Class of the true causal graph. We show our algorithm achieves comparable accuracy and a faster time to solution for biologically-tuned synthetic networks and networks up to ${10^4}$ variables. This makes our method applicable to gene regulatory network inference and other domains with high-dimensional structured hypothesis spaces.
