Table of Contents
Fetching ...

Coordinated Multi-Neighborhood Learning on a Directed Acyclic Graph

Stephen Smith, Qing Zhou

TL;DR

This work tackles the challenge of causal discovery in high-dimensional settings by focusing on local structure around user-specified target nodes rather than learning the full DAG. It introduces Coordinated Multi-Neighborhood Learning (CML), a two-stage constraint-based framework that builds a maximal ancestral graph over the union of target neighborhoods NB_T and then orients edges jointly across all neighborhoods using a coordinated application of FCI rules. The authors prove population-level and Gaussian-consistency results for the local MAG/PAG learned by CML and demonstrate substantial gains in accuracy and computational efficiency over global methods (like PC) and non-coordinated local methods (SNL) in synthetic experiments, as well as competitive performance on real gene regulatory data. The findings suggest that coordinated local structure learning can yield more precise causal inferences with far lower computational cost, enabling scalable causal discovery focused on scientifically relevant subsets of variables.

Abstract

Learning the structure of causal directed acyclic graphs (DAGs) is useful in many areas of machine learning and artificial intelligence, with wide applications. However, in the high-dimensional setting, it is challenging to obtain good empirical and theoretical results without strong and often restrictive assumptions. Additionally, it is questionable whether all of the variables purported to be included in the network are observable. It is of interest then to restrict consideration to a subset of the variables for relevant and reliable inferences. In fact, researchers in various disciplines can usually select a set of target nodes in the network for causal discovery. This paper develops a new constraint-based method for estimating the local structure around multiple user-specified target nodes, enabling coordination in structure learning between neighborhoods. Our method facilitates causal discovery without learning the entire DAG structure. We establish consistency results for our algorithm with respect to the local neighborhood structure of the target nodes in the true graph. Experimental results on synthetic and real-world data show that our algorithm is more accurate in learning the neighborhood structures with much less computational cost than standard methods that estimate the entire DAG. An R package implementing our methods may be accessed at https://github.com/stephenvsmith/CML.

Coordinated Multi-Neighborhood Learning on a Directed Acyclic Graph

TL;DR

This work tackles the challenge of causal discovery in high-dimensional settings by focusing on local structure around user-specified target nodes rather than learning the full DAG. It introduces Coordinated Multi-Neighborhood Learning (CML), a two-stage constraint-based framework that builds a maximal ancestral graph over the union of target neighborhoods NB_T and then orients edges jointly across all neighborhoods using a coordinated application of FCI rules. The authors prove population-level and Gaussian-consistency results for the local MAG/PAG learned by CML and demonstrate substantial gains in accuracy and computational efficiency over global methods (like PC) and non-coordinated local methods (SNL) in synthetic experiments, as well as competitive performance on real gene regulatory data. The findings suggest that coordinated local structure learning can yield more precise causal inferences with far lower computational cost, enabling scalable causal discovery focused on scientifically relevant subsets of variables.

Abstract

Learning the structure of causal directed acyclic graphs (DAGs) is useful in many areas of machine learning and artificial intelligence, with wide applications. However, in the high-dimensional setting, it is challenging to obtain good empirical and theoretical results without strong and often restrictive assumptions. Additionally, it is questionable whether all of the variables purported to be included in the network are observable. It is of interest then to restrict consideration to a subset of the variables for relevant and reliable inferences. In fact, researchers in various disciplines can usually select a set of target nodes in the network for causal discovery. This paper develops a new constraint-based method for estimating the local structure around multiple user-specified target nodes, enabling coordination in structure learning between neighborhoods. Our method facilitates causal discovery without learning the entire DAG structure. We establish consistency results for our algorithm with respect to the local neighborhood structure of the target nodes in the true graph. Experimental results on synthetic and real-world data show that our algorithm is more accurate in learning the neighborhood structures with much less computational cost than standard methods that estimate the entire DAG. An R package implementing our methods may be accessed at https://github.com/stephenvsmith/CML.
Paper Structure (20 sections, 5 theorems, 7 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 5 theorems, 7 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

Under Assumption asp:inp, the $\mathcal{G}^*_N$ defined by the above procedure is a MAG.

Figures (6)

  • Figure 1: An illustration of the CML algorithm. (a) The neighborhoods of two target nodes. The highlighted nodes $\{X_3,X_8\}$ are the specified target nodes, the gray nodes are members of the Mb of one of the target nodes, and the white nodes are second-order neighbors. (b) The graph produced after the first phase of skeleton recovery. Edges in red are between-neighborhood edges and edges in black are within-neighborhood edges. The edge in blue will be removed during the second phase of skeleton recovery. (c) Output of the CML algorithm. (d) Output of the Single Neighborhood Learning (SNL) algorithm.
  • Figure 2: Comparisons between the global and local algorithms with respect to accuracy by providing the distributions of F1 scores for different combinations of network size category and skeleton CI test significance levels.
  • Figure 3: Comparisons between the global and local algorithms with respect to complexity. (a) The distributions of the number of CI tests used by each algorithm for different network sizes on a log scale. (b) The distributions of runtime for different combinations of network size category and skeleton CI test significance levels on a log scale.
  • Figure 4: The distributions of the parent recovery accuracy F1 scores for different network size and significance level combinations. (a) The loose F1 score; (b) The strict F1 score.
  • Figure 5: The distributions of test data log-likelihood scores for different algorithm and parent-identification strategy combinations. Scores are adjusted by normalizing according to the number of cells and the number of nodes in the target set for easier comparison.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Remark 1
  • Remark 2
  • Lemma 1
  • Theorem 2
  • Theorem 3
  • Corollary 4
  • Theorem 5
  • proof : Proof of Lemma \ref{['lm:mag']}
  • proof : Proof of Theorem \ref{['thm:pop']}