Analysis of Estimating the Bayes Rule for Gaussian Mixture Models with a Specified Missing-Data Mechanism
Ziyang Lyu
TL;DR
This paper investigates semi-supervised learning for Gaussian mixture models under a missing-data mechanism that ties label absence to the observed features via an entropy-based logistic model. By formulating full and incomplete likelihoods that incorporate the missingness mechanism, it shows that Bayes rules derived from the full missing-data model can outperform fully supervised classifiers, especially when class overlap is moderate and missing-label proportions are favorable. Simulations extend the analysis to unequal covariances and three-class mixtures, consistently showing gains for the full-PC approach over CC and over the ig baseline. Real-data applications on interneuron and skin lesion datasets demonstrate that accounting for missingness improves classification accuracy and aligns with entropy-based expectations, suggesting practical value for MAR-informed SSL in diverse domains.
Abstract
Semi-supervised learning (SSL) approaches have been successfully applied in a wide range of engineering and scientific fields. This paper investigates the generative model framework with a missingness mechanism for unclassified observations, as introduced by Ahfock and McLachlan(2020). We show that in a partially classified sample, a classifier using Bayes rule of allocation with a missing-data mechanism can surpass a fully supervised classifier in a two-class normal homoscedastic model, especially with moderate to low overlap and proportion of missing class labels, or with large overlap but few missing labels. It also outperforms a classifier with no missing-data mechanism regardless of the overlap region or the proportion of missing class labels. Our exploration of two- and three-component normal mixture models with unequal covariances through simulations further corroborates our findings. Finally, we illustrate the use of the proposed classifier with a missing-data mechanism on interneuronal and skin lesion datasets.
