Table of Contents
Fetching ...

Differentiable Structure Learning and Causal Discovery for General Binary Data

Chang Deng, Bryon Aragam

TL;DR

This work addresses causal-structure learning for general binary data without restricting the data-generating process. It introduces a fully differentiable framework based on the multivariate Bernoulli distribution, learning a sparse, equivalence-class-consistent DAG by optimizing a cross-entropy objective over higher-order interaction features and enforcing acyclicity with a differentiable constraint. The authors establish non-identifiability of DAGs from observational data, characterize the full equivalence class of compatible graph–parameter pairs, and show that, under SMR, the sparsest graph is identifiable up to Markov equivalence; they further provide a population-level guarantee that suitable regularization yields recovery of the minimal class. To scale, they propose BiNOTEARS, a two-stage approach leveraging discrete adaptations of continuous DAG learners, achieving competitive performance on synthetic and real networks. Overall, the paper offers a principled, assumption-light pathway for learning complex causal structures in discrete data with theoretical guarantees and practical scalability.

Abstract

Existing methods for differentiable structure learning in discrete data typically assume that the data are generated from specific structural equation models. However, these assumptions may not align with the true data-generating process, which limits the general applicability of such methods. Furthermore, current approaches often ignore the complex dependence structure inherent in discrete data and consider only linear effects. We propose a differentiable structure learning framework that is capable of capturing arbitrary dependencies among discrete variables. We show that although general discrete models are unidentifiable from purely observational data, it is possible to characterize the complete set of compatible parameters and structures. Additionally, we establish identifiability up to Markov equivalence under mild assumptions. We formulate the learning problem as a single differentiable optimization task in the most general form, thereby avoiding the unrealistic simplifications adopted by previous methods. Empirical results demonstrate that our approach effectively captures complex relationships in discrete data.

Differentiable Structure Learning and Causal Discovery for General Binary Data

TL;DR

This work addresses causal-structure learning for general binary data without restricting the data-generating process. It introduces a fully differentiable framework based on the multivariate Bernoulli distribution, learning a sparse, equivalence-class-consistent DAG by optimizing a cross-entropy objective over higher-order interaction features and enforcing acyclicity with a differentiable constraint. The authors establish non-identifiability of DAGs from observational data, characterize the full equivalence class of compatible graph–parameter pairs, and show that, under SMR, the sparsest graph is identifiable up to Markov equivalence; they further provide a population-level guarantee that suitable regularization yields recovery of the minimal class. To scale, they propose BiNOTEARS, a two-stage approach leveraging discrete adaptations of continuous DAG learners, achieving competitive performance on synthetic and real networks. Overall, the paper offers a principled, assumption-light pathway for learning complex causal structures in discrete data with theoretical guarantees and practical scalability.

Abstract

Existing methods for differentiable structure learning in discrete data typically assume that the data are generated from specific structural equation models. However, these assumptions may not align with the true data-generating process, which limits the general applicability of such methods. Furthermore, current approaches often ignore the complex dependence structure inherent in discrete data and consider only linear effects. We propose a differentiable structure learning framework that is capable of capturing arbitrary dependencies among discrete variables. We show that although general discrete models are unidentifiable from purely observational data, it is possible to characterize the complete set of compatible parameters and structures. Additionally, we establish identifiability up to Markov equivalence under mild assumptions. We formulate the learning problem as a single differentiable optimization task in the most general form, thereby avoiding the unrealistic simplifications adopted by previous methods. Empirical results demonstrate that our approach effectively captures complex relationships in discrete data.

Paper Structure

This paper contains 56 sections, 10 theorems, 107 equations, 6 figures, 1 table, 4 algorithms.

Key Result

Corollary 1

If $X\sim\mathrm{MultiBernoulli}(\boldsymbol{p})$, then every marginal distribution and every conditional distribution of $X$ is again multivariate Bernoulli.

Figures (6)

  • Figure 1: Results in terms of SHD between MECs of estimated graph and ground truth. Lower is better. Column: $k = \{1,2\}$. Row: random graph types. $\{\text{ER,SF}\}$-$k =\{\text{Scale-Free}, \text{Erdős–Rényi}\}$ graphs with $k\cdot d$ expected edges. Here $p = \{5,6,7,8,9\}$. Error bars denote the standard error computed over 10 replications.
  • Figure 2: Results in terms of SHD between MECs of estimated graph and ground truth. Lower is better. Column: $k = \{1,2,4\}$. Row: random graph types. $\{\text{ER,SF}\}$-$k =\{\text{Scale-Free}, \text{Erdős–Rényi}\}$ graphs with $kd$ expected edges. Here $p = \{10,20,30,40\}$. BiNOTEARS is our two-stage approach.
  • Figure 3: Results in terms of SHD between MECs of estimated graph and ground truth. Lower is better. Column: $k = \{1,2\}$. Row: random graph types. $\{\text{ER,SF}\}$-$k =\{\text{Scale-Free}, \text{Erdős–Rényi}\}$ graphs with $kd$ expected edges. Here $p = \{5,6,7,8,9\}$. Error bars denote the standard error computed over 10 replications.
  • Figure 4: Results in terms of SHD between MECs of estimated graph and ground truth. Lower is better. Data are generated using extended feature map $\Phi^{\text{1st+pth}}$ in \ref{['eq:extendfeature1']}. Column: $k = \{1,2,4\}$. Row: random graph types. $\{\text{ER,SF}\}$-$k =\{\text{Scale-Free}, \text{Erdős–Rényi}\}$ graphs with $kd$ expected edges. Here $p = \{10,20,30,40\}$. BiNOTEARS is our two stage approach.
  • Figure 5: Results in terms of SHD between MECs of estimated graph and ground truth. Lower is better. Data are generated using extended feature map $\Phi^{\text{2nd}}$ in \ref{['eq:extendfeature1']}. Column: $k = \{1,2,4\}$. Row: random graph types. $\{\text{ER,SF}\}$-$k =\{\text{Scale-Free}, \text{Erdős–Rényi}\}$ graphs with $kd$ expected edges. Here $p = \{10,20,30,40\}$. BiNOTEARS is our two stage approach.
  • ...and 1 more figures

Theorems & Definitions (18)

  • Definition 1: Multivariate Bernoulli distribution dai2013multivariate
  • Corollary 1
  • Theorem 1
  • Remark 1
  • Remark 2
  • Definition 2: Minimality deng2024markov
  • Theorem 2
  • Theorem 3
  • Lemma 1
  • Lemma 2
  • ...and 8 more