Table of Contents
Fetching ...

An Asymptotically Optimal Coordinate Descent Algorithm for Learning Bayesian Networks from Gaussian Models

Tong Xu, Simge Küçükyavuz, Ali Shojaie, Armeen Taeb

TL;DR

This work tackles learning Bayesian networks from Gaussian observational data under a linear SEM by optimizing an $\\ell_0$-penalized Gaussian log-likelihood. It introduces a coordinate descent method in the $\\Gamma$-parameterization that respects DAG constraints and uses spacer steps to stabilize updates, with theoretical guarantees of convergence to a coordinate-wise minimum and asymptotic attainment of the $\\ ext{ell}_0$-MLE objective. The authors prove a finite-sample optimality gap bound that vanishes as the sample size grows, and they show that, under mild assumptions, all local minima converge to near-global minima in large samples. Empirically, the method scales to large graphs, competes favorably with state-of-the-art baselines, and remains robust to moderate heteroscedasticity, with real-data experiments affirming practical applicability in causal discovery contexts.

Abstract

This paper studies the problem of learning Bayesian networks from continuous observational data, generated according to a linear Gaussian structural equation model. We consider an $\ell_0$-penalized maximum likelihood estimator for this problem which is known to have favorable statistical properties but is computationally challenging to solve, especially for medium-sized Bayesian networks. We propose a new coordinate descent algorithm to approximate this estimator and prove several remarkable properties of our procedure: the algorithm converges to a coordinate-wise minimum, and despite the non-convexity of the loss function, as the sample size tends to infinity, the objective value of the coordinate descent solution converges to the optimal objective value of the $\ell_0$-penalized maximum likelihood estimator. Finite-sample statistical consistency guarantees are also established. To the best of our knowledge, our proposal is the first coordinate descent procedure endowed with optimality and statistical guarantees in the context of learning Bayesian networks. Numerical experiments on synthetic and real data demonstrate that our coordinate descent method can obtain near-optimal solutions while being scalable.

An Asymptotically Optimal Coordinate Descent Algorithm for Learning Bayesian Networks from Gaussian Models

TL;DR

This work tackles learning Bayesian networks from Gaussian observational data under a linear SEM by optimizing an -penalized Gaussian log-likelihood. It introduces a coordinate descent method in the -parameterization that respects DAG constraints and uses spacer steps to stabilize updates, with theoretical guarantees of convergence to a coordinate-wise minimum and asymptotic attainment of the -MLE objective. The authors prove a finite-sample optimality gap bound that vanishes as the sample size grows, and they show that, under mild assumptions, all local minima converge to near-global minima in large samples. Empirically, the method scales to large graphs, competes favorably with state-of-the-art baselines, and remains robust to moderate heteroscedasticity, with real-data experiments affirming practical applicability in causal discovery contexts.

Abstract

This paper studies the problem of learning Bayesian networks from continuous observational data, generated according to a linear Gaussian structural equation model. We consider an -penalized maximum likelihood estimator for this problem which is known to have favorable statistical properties but is computationally challenging to solve, especially for medium-sized Bayesian networks. We propose a new coordinate descent algorithm to approximate this estimator and prove several remarkable properties of our procedure: the algorithm converges to a coordinate-wise minimum, and despite the non-convexity of the loss function, as the sample size tends to infinity, the objective value of the coordinate descent solution converges to the optimal objective value of the -penalized maximum likelihood estimator. Finite-sample statistical consistency guarantees are also established. To the best of our knowledge, our proposal is the first coordinate descent procedure endowed with optimality and statistical guarantees in the context of learning Bayesian networks. Numerical experiments on synthetic and real data demonstrate that our coordinate descent method can obtain near-optimal solutions while being scalable.
Paper Structure (19 sections, 15 theorems, 36 equations, 3 figures, 4 tables, 3 algorithms)

This paper contains 19 sections, 15 theorems, 36 equations, 3 figures, 4 tables, 3 algorithms.

Key Result

Proposition 3

The solution to problem Problem:update, for $u,v=1,\ldots,m$ and $v\not = u$ is given by where $A_{uu}=\sum\limits_{j\not=u}\Gamma_{ju}\hat{\Sigma}_{ju} + \sum\limits_{k\not=u}\Gamma_{ku}\hat{\Sigma}_{uk}$ and $A_{uv} = \sum\limits_{j\not=u}\Gamma_{jv}\hat{\Sigma}_{ju} + \sum\limits_{k\not=u}\Gamma_{kv}\hat{\Sigma}_{uk}.$

Figures (3)

  • Figure 1: Left: scenario for Setting I, middle: scenario for setting II.1, and right: scenario for setting II.2; solid directed edges represent directed edges that are assumed to be in the estimate $\hat{E}$, crossed out solid directed edges represent directed edges that are assumed to be excluded in the estimate $\hat{E}$, crossed out solid undirected edges indicate that the corresponding nodes are not connected in $\hat{E}$, and crossed out dashed directed edge indicates that the edge is not present in $\hat{E}$ as adding it would create a cycle.
  • Figure 2: Convergence of CD-$\ell_0$ to an optimal solution
  • Figure 3: Learning causal models from causal chambers data in gamella2024causal

Theorems & Definitions (21)

  • Definition 1
  • Definition 2
  • Proposition 3
  • Remark 4
  • Definition 5
  • Theorem 6
  • Theorem 7
  • Lemma 8
  • Lemma 9
  • Lemma 10
  • ...and 11 more