Parameter Estimation in DAGs from Incomplete Data via Optimal Transport

Vy Vo; Trung Le; Tung-Long Vuong; He Zhao; Edwin Bonilla; Dinh Phung

Parameter Estimation in DAGs from Incomplete Data via Optimal Transport

Vy Vo, Trung Le, Tung-Long Vuong, He Zhao, Edwin Bonilla, Dinh Phung

TL;DR

This work introduces OTP-DAG, an optimal-transport-based framework for parameter learning in directed acyclic graphical models with latent variables. By recasting learning as minimizing a Wasserstein distance between data and model distributions and employing backward maps from observed nodes to their parents, the approach yields a tractable, end-to-end objective that extends Wasserstein auto-encoders to general DAGs. The authors provide theoretical justification (via a key OT theorem) and extensive empirical evidence across LDA, HMMs, and discrete representation learning, showing robust parameter recovery and competitive downstream performance versus EM and VI baselines. OTP-DAG offers a scalable, flexible alternative to likelihood-based methods, with potential for broader applicability to complex graphical models and future structure-learning tasks.

Abstract

Estimating the parameters of a probabilistic directed graphical model from incomplete data is a long-standing challenge. This is because, in the presence of latent variables, both the likelihood function and posterior distribution are intractable without assumptions about structural dependencies or model classes. While existing learning methods are fundamentally based on likelihood maximization, here we offer a new view of the parameter learning problem through the lens of optimal transport. This perspective licenses a general framework that operates on any directed graphs without making unrealistic assumptions on the posterior over the latent variables or resorting to variational approximations. We develop a theoretical framework and support it with extensive empirical evidence demonstrating the versatility and robustness of our approach. Across experiments, we show that not only can our method effectively recover the ground-truth parameters but it also performs comparably or better than competing baselines on downstream applications.

Parameter Estimation in DAGs from Incomplete Data via Optimal Transport

TL;DR

Abstract

Paper Structure (42 sections, 2 theorems, 32 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 42 sections, 2 theorems, 32 equations, 11 figures, 6 tables, 1 algorithm.

Introduction
OT as an alternative to MLE.
Contributions.
Related work
Variational Inference.
Optimal Transport.
Preliminaries
Directed Graphical Models.
Optimal transport.
Optimal Transport for Learning Directed Graphical Models
Proof.
Remark.
Applications
Baselines.
Experimental setup.
...and 27 more sections

Key Result

Theorem 4.1

For every $\phi_i$ as defined above and fixed $\psi_{\theta}$, where $\mathrm{PA}_{X_{\mathbf{O}}} := [[X_{ij}]_{j \in \mathrm{PA}_{X_i}}]_{i \in {\mathbf{O}}}$.

Figures (11)

Figure 1: Visualization of mean absolute errors of the inferred means $\hat{\mu}$ and the true values $\mu$ for $300$ steps, averaged over $100$ simulations. $\mu_{ki}$ indicates the mean of the component $k$ at dimension $i$. The red line represents our method OTP-DAG. The blue line represents EM. Three mis-specified cases are studied: Case (1) mis-specified variances, Case (2) mis-specified weights and Case (3) mis-specified both variances and weights.
Figure 2: (Left) A DAG represents a system of $4$ endogenous variables where $X_1, X_3$ are observed (black-shaded) and $X_2, X_4$ are hidden variables (non-shaded). (Middle) The extended DAG includes an additional set of independent exogenous variables $U_1, U_2, U_3, U_4$ (grey-shaded) acting on each endogenous variable. $U_1, U_2, U_3, U_4 \sim P(U)$ where $P(U)$ is a prior product distribution. (Right) Visualization of our backward-forward algorithm, where the dashed arcs represent the backward maps involved in optimization.
Figure 3: Empirical structures of (left) latent Dirichlet allocation model (in plate notation), (middle) standard hidden Markov model, and (right) discrete representation learning.
Figure 4: Topic-word distributions inferred by each method from the 1st set of synthetic data after 300 training epochs.
Figure 5: (Left) Algorithmic DAG. (Right) Standard Auto-encoder.
...and 6 more figures

Theorems & Definitions (4)

Theorem 4.1
proof
Theorem 1.1
proof

Parameter Estimation in DAGs from Incomplete Data via Optimal Transport

TL;DR

Abstract

Parameter Estimation in DAGs from Incomplete Data via Optimal Transport

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (4)