Table of Contents
Fetching ...

Score-based Greedy Search for Structure Identification of Partially Observed Linear Causal Models

Xinshuai Dong, Ignavier Ng, Haoyue Dai, Jiaqi Sun, Xiangchen Song, Peter Spirtes, Kun Zhang

TL;DR

The paper addresses identifying the full structure of partially observed linear causal models from observational data by developing a score-based, greedy search framework. It introduces Generalized N Factor Model (GNFM) and proves identifiability and global consistency of using the likelihood score to recover the structure up to the Markov Equivalence Class (MEC). The Latent variable Greedy Equivalence Search (LGES) algorithm operationalizes this theory in two phases (latent-to-observed and latent-to-latent) and is shown to be asymptotically correct under GNFM, with strong empirical performance on synthetic and real datasets, robustness to misspecification, and practical runtime. The work provides a scalable, theoretically grounded approach for discovering latent and observed causal structure from covariances, with potential extensions to non-Gaussian and nonlinear settings.

Abstract

Identifying the structure of a partially observed causal system is essential to various scientific fields. Recent advances have focused on constraint-based causal discovery to solve this problem, and yet in practice these methods often face challenges related to multiple testing and error propagation. These issues could be mitigated by a score-based method and thus it has raised great attention whether there exists a score-based greedy search method that can handle the partially observed scenario. In this work, we propose the first score-based greedy search method for the identification of structure involving latent variables with identifiability guarantees. Specifically, we propose Generalized N Factor Model and establish the global consistency: the true structure including latent variables can be identified up to the Markov equivalence class by using score. We then design Latent variable Greedy Equivalence Search (LGES), a greedy search algorithm for this class of model with well-defined operators, which search very efficiently over the graph space to find the optimal structure. Our experiments on both synthetic and real-life data validate the effectiveness of our method (code will be publicly available).

Score-based Greedy Search for Structure Identification of Partially Observed Linear Causal Models

TL;DR

The paper addresses identifying the full structure of partially observed linear causal models from observational data by developing a score-based, greedy search framework. It introduces Generalized N Factor Model (GNFM) and proves identifiability and global consistency of using the likelihood score to recover the structure up to the Markov Equivalence Class (MEC). The Latent variable Greedy Equivalence Search (LGES) algorithm operationalizes this theory in two phases (latent-to-observed and latent-to-latent) and is shown to be asymptotically correct under GNFM, with strong empirical performance on synthetic and real datasets, robustness to misspecification, and practical runtime. The work provides a scalable, theoretically grounded approach for discovering latent and observed causal structure from covariances, with potential extensions to non-Gaussian and nonlinear settings.

Abstract

Identifying the structure of a partially observed causal system is essential to various scientific fields. Recent advances have focused on constraint-based causal discovery to solve this problem, and yet in practice these methods often face challenges related to multiple testing and error propagation. These issues could be mitigated by a score-based method and thus it has raised great attention whether there exists a score-based greedy search method that can handle the partially observed scenario. In this work, we propose the first score-based greedy search method for the identification of structure involving latent variables with identifiability guarantees. Specifically, we propose Generalized N Factor Model and establish the global consistency: the true structure including latent variables can be identified up to the Markov equivalence class by using score. We then design Latent variable Greedy Equivalence Search (LGES), a greedy search algorithm for this class of model with well-defined operators, which search very efficiently over the graph space to find the optimal structure. Our experiments on both synthetic and real-life data validate the effectiveness of our method (code will be publicly available).

Paper Structure

This paper contains 37 sections, 13 theorems, 13 equations, 8 figures, 5 tables, 2 algorithms.

Key Result

Proposition 1

Consider the model defined in Def. definition:polcm, and let $F$$=$, and $\Omega$$=$, where $\Omega$ is the diagonal covariance matrix of $\epsilon_{\mathbf{V}_\mathcal{G}}$. Let $M=((I-F_{\mathbf{L}\mathbf{L}}-F_{\mathbf{L}\mathbf{X}}(I-F_{\mathbf{X}\mathbf{X}})^{-1} F_{\mathbf{X}\mathbf{L}}))^{-1}

Figures (8)

  • Figure 1: Without further graphical assumption, the algebraic equivalence class is very large and not very informative: suppose the ground truth $\mathcal{G}^*$ in (a), by \ref{['thm:equivalence_equality_constraints']} we may arrive at either $\hat{\mathcal{G}}_1$ (b) or $\hat{\mathcal{G}}_2$ (c), both are algebraically equivalent to $\mathcal{G}^*$.
  • Figure 2: An illustrative example of the graph that satisfies generalized N factor model in \ref{['definition:gnfm']}.
  • Figure 3: An illustration of the whole process of LGES, where (a) $\mathcal{S}_{\text{init}}$ is the initial state of \ref{['alg:lges_phase1']}, (b) $\mathcal{S}_{\text{phase1}}$ is the output of \ref{['alg:lges_phase1']}, and (c) $\mathcal{S}_{\text{final}}$ is the final output of \ref{['alg:lges_phase2']}.
  • Figure 4: Causal structure (CPDAG) recovered by LGES on Multi-tasking behavior dataset.
  • Figure 5: Causal structure (CPDAG) recovered by LGES on Big Five personality dataset.
  • ...and 3 more figures

Theorems & Definitions (28)

  • Definition 1: Partially Observed Linear Causal Models
  • Proposition 1: Parameterization of Population Covariance dong2024parameter
  • Theorem 1: Algebraic Equivalence by Score and Dimension
  • Remark 1
  • Definition 2: Generalized N Factor Model
  • Theorem 2: Identifiability of Generalized N Factor Models by Equality Constraint up to MEC
  • Corollary 1: Global Consistency by Score for Generalized N Factor Models
  • Definition 3: Initial State for Generalized N Factor Model
  • Lemma 1: Properties of Initial State
  • Definition 4: Delete Operator $\mathcal{O}_{\mathbf{L}\mathbf{X}}$
  • ...and 18 more