Table of Contents
Fetching ...

Hard and Soft EM in Bayesian Network Learning from Incomplete Data

Andrea Ruggieri, Francesco Stranieri, Fabio Stella, Marco Scutari

TL;DR

This paper investigates how using hard EM (single imputation) versus soft EM (belief-propagation-based expectation over missing values) affects Bayesian network learning from incomplete data, addressing both parameter and structure learning via Structural EM. Through an extensive simulation study across multiple BN sizes and missing data conditions, the authors reveal nuanced guidance: hard EM often yields better parameter estimates in large BNs or certain missingness patterns, while soft EM can be preferable under other configurations; a soft-forced variant generally underperforms relative to the best of soft or hard EM. A practical decision-tree derived from the results helps practitioners choose the EM approach tailored to data characteristics. The work provides practical implications for BN learning workflows, especially in domains with incomplete data, and outlines limitations and avenues for extending the analysis to Gaussian BNs and broader missing data scenarios.

Abstract

Incomplete data are a common feature in many domains, from clinical trials to industrial applications. Bayesian networks (BNs) are often used in these domains because of their graphical and causal interpretations. BN parameter learning from incomplete data is usually implemented with the Expectation-Maximisation algorithm (EM), which computes the relevant sufficient statistics ("soft EM") using belief propagation. Similarly, the Structural Expectation-Maximisation algorithm (Structural EM) learns the network structure of the BN from those sufficient statistics using algorithms designed for complete data. However, practical implementations of parameter and structure learning often impute missing data ("hard EM") to compute sufficient statistics instead of using belief propagation, for both ease of implementation and computational speed. In this paper, we investigate the question: what is the impact of using imputation instead of belief propagation on the quality of the resulting BNs? From a simulation study using synthetic data and reference BNs, we find that it is possible to recommend one approach over the other in several scenarios based on the characteristics of the data. We then use this information to build a simple decision tree to guide practitioners in choosing the EM algorithm best suited to their problem.

Hard and Soft EM in Bayesian Network Learning from Incomplete Data

TL;DR

This paper investigates how using hard EM (single imputation) versus soft EM (belief-propagation-based expectation over missing values) affects Bayesian network learning from incomplete data, addressing both parameter and structure learning via Structural EM. Through an extensive simulation study across multiple BN sizes and missing data conditions, the authors reveal nuanced guidance: hard EM often yields better parameter estimates in large BNs or certain missingness patterns, while soft EM can be preferable under other configurations; a soft-forced variant generally underperforms relative to the best of soft or hard EM. A practical decision-tree derived from the results helps practitioners choose the EM approach tailored to data characteristics. The work provides practical implications for BN learning workflows, especially in domains with incomplete data, and outlines limitations and avenues for extending the analysis to Gaussian BNs and broader missing data scenarios.

Abstract

Incomplete data are a common feature in many domains, from clinical trials to industrial applications. Bayesian networks (BNs) are often used in these domains because of their graphical and causal interpretations. BN parameter learning from incomplete data is usually implemented with the Expectation-Maximisation algorithm (EM), which computes the relevant sufficient statistics ("soft EM") using belief propagation. Similarly, the Structural Expectation-Maximisation algorithm (Structural EM) learns the network structure of the BN from those sufficient statistics using algorithms designed for complete data. However, practical implementations of parameter and structure learning often impute missing data ("hard EM") to compute sufficient statistics instead of using belief propagation, for both ease of implementation and computational speed. In this paper, we investigate the question: what is the impact of using imputation instead of belief propagation on the quality of the resulting BNs? From a simulation study using synthetic data and reference BNs, we find that it is possible to recommend one approach over the other in several scenarios based on the characteristics of the data. We then use this information to build a simple decision tree to guide practitioners in choosing the EM algorithm best suited to their problem.

Paper Structure

This paper contains 11 sections, 14 equations, 5 figures, 3 tables, 2 algorithms.

Figures (5)

  • Figure S1: Decision tree for best practice guidance.
  • Figure S2: Leaf A. No EM algorithm proves to be more effective than the others (data sets with 5% missing data generated from the Alarm BN).
  • Figure S3: Leaf B. Hard EM achieves a value of KLD which is significantly smaller than that achieved by other EM algorithms (data sets with 5% missing data generated from the Property BN).
  • Figure S4: Leaf E. Hard EM achieves a value of KLD which is significantly smaller than that achieved by other EM algorithms (data sets with 1% missing data generated from the Formed BN).
  • Figure S5: Leaf G. Hard EM achieves a value of KLD which is significantly greater than that achieved by other EM algorithms (data sets with 1% missing data generated from the Pathfinder BN).