Table of Contents
Fetching ...

Markov Equivalence and Consistency in Differentiable Structure Learning

Chang Deng, Kevin Bello, Pradeep Ravikumar, Bryon Aragam

TL;DR

By carefully regularizing the likelihood, it is possible to identify the sparsest model in the Markov equivalence class, even in the absence of an identifiable parametrization, thus paving the way for differentiable structure learning under general models and losses.

Abstract

Existing approaches to differentiable structure learning of directed acyclic graphs (DAGs) rely on strong identifiability assumptions in order to guarantee that global minimizers of the acyclicity-constrained optimization problem identifies the true DAG. Moreover, it has been observed empirically that the optimizer may exploit undesirable artifacts in the loss function. We explain and remedy these issues by studying the behavior of differentiable acyclicity-constrained programs under general likelihoods with multiple global minimizers. By carefully regularizing the likelihood, it is possible to identify the sparsest model in the Markov equivalence class, even in the absence of an identifiable parametrization. We first study the Gaussian case in detail, showing how proper regularization of the likelihood defines a score that identifies the sparsest model. Assuming faithfulness, it also recovers the Markov equivalence class. These results are then generalized to general models and likelihoods, where the same claims hold. These theoretical results are validated empirically, showing how this can be done using standard gradient-based optimizers, thus paving the way for differentiable structure learning under general models and losses.

Markov Equivalence and Consistency in Differentiable Structure Learning

TL;DR

By carefully regularizing the likelihood, it is possible to identify the sparsest model in the Markov equivalence class, even in the absence of an identifiable parametrization, thus paving the way for differentiable structure learning under general models and losses.

Abstract

Existing approaches to differentiable structure learning of directed acyclic graphs (DAGs) rely on strong identifiability assumptions in order to guarantee that global minimizers of the acyclicity-constrained optimization problem identifies the true DAG. Moreover, it has been observed empirically that the optimizer may exploit undesirable artifacts in the loss function. We explain and remedy these issues by studying the behavior of differentiable acyclicity-constrained programs under general likelihoods with multiple global minimizers. By carefully regularizing the likelihood, it is possible to identify the sparsest model in the Markov equivalence class, even in the absence of an identifiable parametrization. We first study the Gaussian case in detail, showing how proper regularization of the likelihood defines a score that identifies the sparsest model. Assuming faithfulness, it also recovers the Markov equivalence class. These results are then generalized to general models and likelihoods, where the same claims hold. These theoretical results are validated empirically, showing how this can be done using standard gradient-based optimizers, thus paving the way for differentiable structure learning under general models and losses.
Paper Structure (55 sections, 17 theorems, 117 equations, 14 figures, 1 algorithm)

This paper contains 55 sections, 17 theorems, 117 equations, 14 figures, 1 algorithm.

Key Result

Lemma 1

Let $X$ follow model eq:linear_uniden with $(B^0,\Omega^0)$ and $\Theta^0 = \Theta_f(B^0,\Omega^0)$. Assume that $P(X)$ is faithful to $G^0\coloneq G(B^0)$. Then ${\mathcal{M}}(G^0) ={\mathcal{E}}_{\min}(\Theta^0).$

Figures (14)

  • Figure 1: The plot of $p_{\lambda,\delta}(t)$ with $\lambda = 2, \delta = 1$
  • Figure 2: Results in terms of SHD between MECs of estimated graph and ground truth. Lower is better. Column: $k = \{1,2,4\}$. Row: random graph types. {ER,SF}-$k$ = {Scale-Free,Erdős-Rényi } graphs with $kd$ expected edges. Here $p=\{10,20,50,70,100\},n=1000$.
  • Figure 3: Comparison of raw (orange) vs. standardized (green) data. SHD (lower is better) between Markov equivalence classes (MEC) of recovered and ground truth graphs for ER-2 graphs with $10$ (left) or $50$ (right) nodes. In (b), SHD for VarSort with standardized data is omitted due to its average exceeding 300.
  • Figure 4: Graph: fork structure $X_0\rightarrow X_1$ and $X_0\rightarrow X_2$. For $0<\delta<\delta_0$, the estimated $(B_{\text{est}},\Omega_{\text{est}})\in {\mathcal{E}}_{\min}(\Theta^0)$ because SHD and distance are close to $0$.
  • Figure 5: The plot of $p_{\lambda,\delta}(t)$ with $\lambda = 2, \delta = 1$
  • ...and 9 more figures

Theorems & Definitions (41)

  • Definition 1
  • Definition 2: Minimality
  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Remark 1
  • Lemma 2
  • Theorem 3
  • Remark 2
  • Definition 3
  • ...and 31 more