Analysis of Estimating the Bayes Rule for Gaussian Mixture Models with a Specified Missing-Data Mechanism

Ziyang Lyu

Analysis of Estimating the Bayes Rule for Gaussian Mixture Models with a Specified Missing-Data Mechanism

Ziyang Lyu

TL;DR

This paper investigates semi-supervised learning for Gaussian mixture models under a missing-data mechanism that ties label absence to the observed features via an entropy-based logistic model. By formulating full and incomplete likelihoods that incorporate the missingness mechanism, it shows that Bayes rules derived from the full missing-data model can outperform fully supervised classifiers, especially when class overlap is moderate and missing-label proportions are favorable. Simulations extend the analysis to unequal covariances and three-class mixtures, consistently showing gains for the full-PC approach over CC and over the ig baseline. Real-data applications on interneuron and skin lesion datasets demonstrate that accounting for missingness improves classification accuracy and aligns with entropy-based expectations, suggesting practical value for MAR-informed SSL in diverse domains.

Abstract

Semi-supervised learning (SSL) approaches have been successfully applied in a wide range of engineering and scientific fields. This paper investigates the generative model framework with a missingness mechanism for unclassified observations, as introduced by Ahfock and McLachlan(2020). We show that in a partially classified sample, a classifier using Bayes rule of allocation with a missing-data mechanism can surpass a fully supervised classifier in a two-class normal homoscedastic model, especially with moderate to low overlap and proportion of missing class labels, or with large overlap but few missing labels. It also outperforms a classifier with no missing-data mechanism regardless of the overlap region or the proportion of missing class labels. Our exploration of two- and three-component normal mixture models with unequal covariances through simulations further corroborates our findings. Finally, we illustrate the use of the proposed classifier with a missing-data mechanism on interneuronal and skin lesion datasets.

Analysis of Estimating the Bayes Rule for Gaussian Mixture Models with a Specified Missing-Data Mechanism

TL;DR

Abstract

Paper Structure (12 sections, 2 theorems, 38 equations, 6 figures, 6 tables)

This paper contains 12 sections, 2 theorems, 38 equations, 6 figures, 6 tables.

Introduction
Notation
Methodology
Missing data mechanism
Two-class normal homoscedastic model
Simulation study
Two-class normal model with unequal covariance matrices
Three-class normal model with unequal covariance matrices
Application
Interneuron dataset
Skin lesion dataset
Discussion

Key Result

lemma thmcounterlemma

Given a two-class normal homoscedastic model in the canonical form in the same case of equal prior probabilities $\pi_1=\pi_2$, the entropy of observation $\mathbf{y}$ increases as the squared Mahalanobis distance decreases (equivalently the overlap region becomes smaller).

Figures (6)

Figure 1: Plot of the asymptotic relative efficiency $\operatorname{ARE}(\boldsymbol{\beta}_{\mathrm{PC}}^{(\mathrm{full})}, \boldsymbol{\beta}_{\mathrm{CC}})$ versus the squared root of the squared Mahalanobis distance between the two classes, $\Delta$, for $\pi_1=\pi_2$.
Figure 2: Plot of the asymptotic relative efficiency $\operatorname{ARE}(\boldsymbol{\beta}_{\mathrm{PC}}^{(\mathrm{full})}, \boldsymbol{\beta}_{\mathrm{PC}}^{(\mathrm{ig})})$ versus the squared root of the squared Mahalanobis distance between the two classes, $\Delta$, for $\pi_1=\pi_2$.
Figure 3: Simulated scalar observations in the case of $n=5000, \Delta \in$$\{0.25,2.5,10\}$, and $\pi_1=\pi_2$ (black represents class 1 ;red represents class 2 ; blue represents unclassified observations with $\left.\xi=(1,0.5)^T\right)$. (a) Large overlap region, (b) Moderate overlap region, and (c) Small overlap region represent the simulated scalar observations when $\pi_1=\pi_2$. (d) to (f) show the same overlap regions as (a) to (c) but include unclassified observations (blue) represented by $\boldsymbol{\xi}=(1,0.5)^T$.
Figure 4: Plots of the degree of separation of two populations versus the average proportion of missing class labels over 1000 samples with sample size $n=200$ and $\pi_1=\pi_2$.
Figure 5: Analysis of the interneuron dataset with regard to the relationship between the entropy and the labeled and unlabeled observations.
...and 1 more figures

Theorems & Definitions (4)

lemma thmcounterlemma
proof
theorem 1
proof

Analysis of Estimating the Bayes Rule for Gaussian Mixture Models with a Specified Missing-Data Mechanism

TL;DR

Abstract

Analysis of Estimating the Bayes Rule for Gaussian Mixture Models with a Specified Missing-Data Mechanism

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)