An Asymptotically Optimal Coordinate Descent Algorithm for Learning Bayesian Networks from Gaussian Models

Tong Xu; Simge Küçükyavuz; Ali Shojaie; Armeen Taeb

An Asymptotically Optimal Coordinate Descent Algorithm for Learning Bayesian Networks from Gaussian Models

Tong Xu, Simge Küçükyavuz, Ali Shojaie, Armeen Taeb

TL;DR

This work tackles learning Bayesian networks from Gaussian observational data under a linear SEM by optimizing an $\\ell_0$-penalized Gaussian log-likelihood. It introduces a coordinate descent method in the $\\Gamma$-parameterization that respects DAG constraints and uses spacer steps to stabilize updates, with theoretical guarantees of convergence to a coordinate-wise minimum and asymptotic attainment of the $\\ ext{ell}_0$-MLE objective. The authors prove a finite-sample optimality gap bound that vanishes as the sample size grows, and they show that, under mild assumptions, all local minima converge to near-global minima in large samples. Empirically, the method scales to large graphs, competes favorably with state-of-the-art baselines, and remains robust to moderate heteroscedasticity, with real-data experiments affirming practical applicability in causal discovery contexts.

Abstract

This paper studies the problem of learning Bayesian networks from continuous observational data, generated according to a linear Gaussian structural equation model. We consider an $\ell_0$-penalized maximum likelihood estimator for this problem which is known to have favorable statistical properties but is computationally challenging to solve, especially for medium-sized Bayesian networks. We propose a new coordinate descent algorithm to approximate this estimator and prove several remarkable properties of our procedure: the algorithm converges to a coordinate-wise minimum, and despite the non-convexity of the loss function, as the sample size tends to infinity, the objective value of the coordinate descent solution converges to the optimal objective value of the $\ell_0$-penalized maximum likelihood estimator. Finite-sample statistical consistency guarantees are also established. To the best of our knowledge, our proposal is the first coordinate descent procedure endowed with optimality and statistical guarantees in the context of learning Bayesian networks. Numerical experiments on synthetic and real data demonstrate that our coordinate descent method can obtain near-optimal solutions while being scalable.

An Asymptotically Optimal Coordinate Descent Algorithm for Learning Bayesian Networks from Gaussian Models

TL;DR

This work tackles learning Bayesian networks from Gaussian observational data under a linear SEM by optimizing an

-penalized Gaussian log-likelihood. It introduces a coordinate descent method in the

-parameterization that respects DAG constraints and uses spacer steps to stabilize updates, with theoretical guarantees of convergence to a coordinate-wise minimum and asymptotic attainment of the

-MLE objective. The authors prove a finite-sample optimality gap bound that vanishes as the sample size grows, and they show that, under mild assumptions, all local minima converge to near-global minima in large samples. Empirically, the method scales to large graphs, competes favorably with state-of-the-art baselines, and remains robust to moderate heteroscedasticity, with real-data experiments affirming practical applicability in causal discovery contexts.

Abstract

This paper studies the problem of learning Bayesian networks from continuous observational data, generated according to a linear Gaussian structural equation model. We consider an

-penalized maximum likelihood estimator for this problem which is known to have favorable statistical properties but is computationally challenging to solve, especially for medium-sized Bayesian networks. We propose a new coordinate descent algorithm to approximate this estimator and prove several remarkable properties of our procedure: the algorithm converges to a coordinate-wise minimum, and despite the non-convexity of the loss function, as the sample size tends to infinity, the objective value of the coordinate descent solution converges to the optimal objective value of the

-penalized maximum likelihood estimator. Finite-sample statistical consistency guarantees are also established. To the best of our knowledge, our proposal is the first coordinate descent procedure endowed with optimality and statistical guarantees in the context of learning Bayesian networks. Numerical experiments on synthetic and real data demonstrate that our coordinate descent method can obtain near-optimal solutions while being scalable.

Paper Structure (19 sections, 15 theorems, 36 equations, 3 figures, 4 tables, 3 algorithms)

This paper contains 19 sections, 15 theorems, 36 equations, 3 figures, 4 tables, 3 algorithms.

Introduction
Problem Setup
A Coordinate Descent Algorithm for DAG Learning
Parameter update without acyclicity constraints
Accounting for acyclicity and full algorithm description
Convergence and Optimality Guarantees
Convergence
Optimality guarantees
On the Optimization Landscape of Problem \ref{['Problem:micp']}
Selecting a Suitable Ordering for Coordinate Descent Updates
Synthetic and Real Experiments
Convergence of CD-l0 solution to an optimal solution
Comparison to benchmarks under near-homoscedastic error
Severe heteroscedastic error and limitations of our method
Real data from causal chambers
...and 4 more sections

Key Result

Proposition 3

The solution to problem Problem:update, for $u,v=1,\ldots,m$ and $v\not = u$ is given by where $A_{uu}=\sum\limits_{j\not=u}\Gamma_{ju}\hat{\Sigma}_{ju} + \sum\limits_{k\not=u}\Gamma_{ku}\hat{\Sigma}_{uk}$ and $A_{uv} = \sum\limits_{j\not=u}\Gamma_{jv}\hat{\Sigma}_{ju} + \sum\limits_{k\not=u}\Gamma_{kv}\hat{\Sigma}_{uk}.$

Figures (3)

Figure 1: Left: scenario for Setting I, middle: scenario for setting II.1, and right: scenario for setting II.2; solid directed edges represent directed edges that are assumed to be in the estimate $\hat{E}$, crossed out solid directed edges represent directed edges that are assumed to be excluded in the estimate $\hat{E}$, crossed out solid undirected edges indicate that the corresponding nodes are not connected in $\hat{E}$, and crossed out dashed directed edge indicates that the edge is not present in $\hat{E}$ as adding it would create a cycle.
Figure 2: Convergence of CD-$\ell_0$ to an optimal solution
Figure 3: Learning causal models from causal chambers data in gamella2024causal

Theorems & Definitions (21)

Definition 1
Definition 2
Proposition 3
Remark 4
Definition 5
Theorem 6
Theorem 7
Lemma 8
Lemma 9
Lemma 10
...and 11 more

An Asymptotically Optimal Coordinate Descent Algorithm for Learning Bayesian Networks from Gaussian Models

TL;DR

Abstract

An Asymptotically Optimal Coordinate Descent Algorithm for Learning Bayesian Networks from Gaussian Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (21)