Table of Contents
Fetching ...

A Bayesian Finite Mixture Model Approach for Mixed-type Data Clustering and Variable Selection with Censored Biomarkers

Yueting Wang, Shu Wang, Jonathan G. Yabes, Chung-Chou H. Chang

Abstract

Clustering mixed-type data remains a major challenge in biomedical research to uncover clinically meaningful subgroups within heterogeneous patient populations. Most existing clustering methods impose restrictive assumptions like local independence, fail to accommodate censored biomarkers, or unable to quantify variable importance. We propose a Bayesian finite mixture model (BFMM) clustering framework that addresses these limitations. BFMM flexibly models both continuous and categorical variables, incorporates three covariance structures to capture cluster-specific dependencies among continuous features, and handles censored observations through likelihood-based imputation. To facilitate feature prioritization, BFMM uses spike-and-slab priors to estimate variable importance on a continuous 0-1 scale. Simulation studies demonstrate that BFMM outperforms existing methods in clustering accuracy, particularly given strong within-cluster correlation or censored variables, and reliably distinguishes informative features from noise under varying conditions. We applied BFMM to two real-world datasets: (1) the SENECA cohort integrating electronic health records from patients with Sepsis; and (2) the EDEN randomized trial of patients with acute lung injury. In both settings, BFMM identified clinically interpretable phenotypes and revealed variable-specific contributions to subgroup differentiation. In the EDEN trial, it also uncovered evidence of treatment heterogeneity. These findings validate BFMM as an effective, interpretable, and practically useful clustering tool for complex biomedical datasets.

A Bayesian Finite Mixture Model Approach for Mixed-type Data Clustering and Variable Selection with Censored Biomarkers

Abstract

Clustering mixed-type data remains a major challenge in biomedical research to uncover clinically meaningful subgroups within heterogeneous patient populations. Most existing clustering methods impose restrictive assumptions like local independence, fail to accommodate censored biomarkers, or unable to quantify variable importance. We propose a Bayesian finite mixture model (BFMM) clustering framework that addresses these limitations. BFMM flexibly models both continuous and categorical variables, incorporates three covariance structures to capture cluster-specific dependencies among continuous features, and handles censored observations through likelihood-based imputation. To facilitate feature prioritization, BFMM uses spike-and-slab priors to estimate variable importance on a continuous 0-1 scale. Simulation studies demonstrate that BFMM outperforms existing methods in clustering accuracy, particularly given strong within-cluster correlation or censored variables, and reliably distinguishes informative features from noise under varying conditions. We applied BFMM to two real-world datasets: (1) the SENECA cohort integrating electronic health records from patients with Sepsis; and (2) the EDEN randomized trial of patients with acute lung injury. In both settings, BFMM identified clinically interpretable phenotypes and revealed variable-specific contributions to subgroup differentiation. In the EDEN trial, it also uncovered evidence of treatment heterogeneity. These findings validate BFMM as an effective, interpretable, and practically useful clustering tool for complex biomedical datasets.

Paper Structure

This paper contains 29 sections, 35 equations, 8 figures, 16 tables, 2 algorithms.

Figures (8)

  • Figure 1: Illustrative bivariate 3-cluster mixture density plots for EEI, EEE, and VVV structures.
  • Figure 2: Graphical representation of the proposed Bayesian FMM framework given VVV covariance structure - BFMM[VVV]. Subscripts $i,g,m$ denote the $i$th observation ($i=1,\dots,n)$, $g$th cluster ($g=1,\dots, G$), and $m$th variable ($m=1,\dots,q$ if continuous, $m=q+1,\dots,M$ if categorical), respectively.
  • Figure 3: Violin plots summarizing adjusted rand index (ARI) by different clustering methods in each simulated scenario. Categorical variables in simulated datasets were treated as continuous when applying the mclust and bootstrap K-means clustering methods.
  • Figure 4: BIC and ICL evaluation for SENECA data by BFMM given $G=1, \dots, 9$ clusters. Information criteria evaluation used Markov chains each taking 10000 Gibbs sampling iterations with 5000 for burn-in.
  • Figure 5: Distributions of the six clinical endpoints by BFMM[VVV] clustering results of SENECA dataset. The annotated p-values are based on Chi-squared test comparing the proportions among the four clusters.
  • ...and 3 more figures