Table of Contents
Fetching ...

Outlier-Robust Multi-Group Gaussian Mixture Modeling with Flexible Group Reassignment

Patricia Puchhammer, Ines Wilms, Peter Filzmoser

Abstract

Do expert-defined or diagnostically-labeled data groups align with clusters inferred through statistical modeling? If not, where do discrepancies between predefined labels and model-based groupings occur and why? In this work, we introduce the multi-group Gaussian mixture model (MG-GMM), the first model developed to investigate these questions. It incorporates prior group information while allowing flexibility to reassign observations to alternative groups based on data-driven evidence. We achieve this by modeling the observations of each group as arising not from a single distribution, but from a Gaussian mixture comprising all group-specific distributions. Moreover, our model offers robustness against cellwise outliers that may obscure or distort the underlying group structure. We propose a novel penalized likelihood approach, called cellMG-GMM, to jointly estimate mixture probabilities, location and scale parameters of the MG-GMM, and detect outliers through a penalty term on the number of flagged cellwise outliers in the objective function. We show that our estimator has good breakdown properties in presence of cellwise outliers. We develop a computationally-efficient EM-based algorithm for cellMG-GMM, and demonstrate its strong performance in identifying and diagnosing observations at the intersection of multiple groups through simulations and diverse applications in medicine and oenology.

Outlier-Robust Multi-Group Gaussian Mixture Modeling with Flexible Group Reassignment

Abstract

Do expert-defined or diagnostically-labeled data groups align with clusters inferred through statistical modeling? If not, where do discrepancies between predefined labels and model-based groupings occur and why? In this work, we introduce the multi-group Gaussian mixture model (MG-GMM), the first model developed to investigate these questions. It incorporates prior group information while allowing flexibility to reassign observations to alternative groups based on data-driven evidence. We achieve this by modeling the observations of each group as arising not from a single distribution, but from a Gaussian mixture comprising all group-specific distributions. Moreover, our model offers robustness against cellwise outliers that may obscure or distort the underlying group structure. We propose a novel penalized likelihood approach, called cellMG-GMM, to jointly estimate mixture probabilities, location and scale parameters of the MG-GMM, and detect outliers through a penalty term on the number of flagged cellwise outliers in the objective function. We show that our estimator has good breakdown properties in presence of cellwise outliers. We develop a computationally-efficient EM-based algorithm for cellMG-GMM, and demonstrate its strong performance in identifying and diagnosing observations at the intersection of multiple groups through simulations and diverse applications in medicine and oenology.

Paper Structure

This paper contains 32 sections, 2 theorems, 50 equations, 17 figures, 1 algorithm.

Key Result

Theorem 1

Given the idealized setting (Section subsec:bdp_mixture and extensions thereof in Section subsec:bdp_grouped) and fixed $\rho_k > 0, \boldsymbol{T}_k \succ 0$, the following breakdown results hold under the cellwise contamination paradigm:

Figures (17)

  • Figure 1: Toy example with two predefined groups (labels 1/2). Colors show model-based assignments; shaded areas are groupwise tolerance ellipses. Left: True data generating process with two mislabeled observations. Middle left: Quadratic discriminant analysis with fixed groups. Middle right: Multi-group GMM with flexible reassignment. Right: Standard GMM-based clustering.
  • Figure 2: Non-ideal setting with overlapping clusters in panel (a) versus ideal setting with well-separated clusters under the cellwise outlier paradigm in panel (b). Arrows indicate the direction of each cluster or outlier sequence.
  • Figure 3: Fictitious ideal data set with $N = 3$ groups (column blocks), $p=5$ (variables in columns per block), and respectively 8, 4, 3 clean observations and 3, 2, 5 added and possibly contaminated observations in the rows, across groups 1-3. Cell colors (red-violet-green) indicate from which group each observation originates, or outlyingness (gray).
  • Figure 4: KL-divergence for the basic balanced Scenario 1 with $N=2$ (top) and Scenario 2 with $N=5$ (bottom), for varying strength of outlyingness $\gamma_{cell}$.
  • Figure 5: Performance of cellwise outlier detection evaluated by on precision, recall and F1-score for the basic balanced Scenario 1 with $N=2$ (top) and Scenario 2 with $N=5$ (bottom), for varying strength of outlyingness $\gamma_{cell}$.
  • ...and 12 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Corollary B1
  • proof
  • proof