Table of Contents
Fetching ...

Copula-based mixture model identification for subgroup clustering with imaging applications

Fei Zheng, Nicolas Duchateau

TL;DR

This work introduces Copula-Based Mixture Models (CBMMs) to cluster data with heterogeneous marginal distributions and dependencies, extending beyond Gaussian mixtures. It develops a Generalized Iterative Conditional Estimation (GICE) framework that simultaneously identifies marginal and copula forms from candidate sets and estimates their parameters, using simulated labels to guide updates. The method is evaluated on synthetic data, large-scale MNIST clustering, and myocardial infarct pattern analysis, demonstrating improved density fit and more flexible, non-elliptical cluster structures compared to EM-based GMMs. The results indicate CBMM-GICE is particularly advantageous for imaging applications with non-Gaussian patterns and heterogeneous dependencies, providing a principled approach to risk stratification and subgroup discovery in medical data.

Abstract

Model-based clustering techniques have been widely applied to various application areas, while most studies focus on canonical mixtures with unique component distribution form. However, this strict assumption is often hard to satisfy. In this paper, we consider the more flexible Copula-Based Mixture Models (CBMMs) for clustering, which allow heterogeneous component distributions composed by flexible choices of marginal and copula forms. More specifically, we propose an adaptation of the Generalized Iterative Conditional Estimation (GICE) algorithm to identify the CBMMs in an unsupervised manner, where the marginal and copula forms and their parameters are estimated iteratively. GICE is adapted from its original version developed for switching Markov model identification with the choice of realization time. Our CBMM-GICE clustering method is then tested on synthetic two-cluster data (N=2000 samples) with discussion of the factors impacting its convergence. Finally, it is compared to the Expectation Maximization identified mixture models with unique component form on the entire MNIST database (N=70000), and on real cardiac magnetic resonance data (N=276) to illustrate its value for imaging applications.

Copula-based mixture model identification for subgroup clustering with imaging applications

TL;DR

This work introduces Copula-Based Mixture Models (CBMMs) to cluster data with heterogeneous marginal distributions and dependencies, extending beyond Gaussian mixtures. It develops a Generalized Iterative Conditional Estimation (GICE) framework that simultaneously identifies marginal and copula forms from candidate sets and estimates their parameters, using simulated labels to guide updates. The method is evaluated on synthetic data, large-scale MNIST clustering, and myocardial infarct pattern analysis, demonstrating improved density fit and more flexible, non-elliptical cluster structures compared to EM-based GMMs. The results indicate CBMM-GICE is particularly advantageous for imaging applications with non-Gaussian patterns and heterogeneous dependencies, providing a principled approach to risk stratification and subgroup discovery in medical data.

Abstract

Model-based clustering techniques have been widely applied to various application areas, while most studies focus on canonical mixtures with unique component distribution form. However, this strict assumption is often hard to satisfy. In this paper, we consider the more flexible Copula-Based Mixture Models (CBMMs) for clustering, which allow heterogeneous component distributions composed by flexible choices of marginal and copula forms. More specifically, we propose an adaptation of the Generalized Iterative Conditional Estimation (GICE) algorithm to identify the CBMMs in an unsupervised manner, where the marginal and copula forms and their parameters are estimated iteratively. GICE is adapted from its original version developed for switching Markov model identification with the choice of realization time. Our CBMM-GICE clustering method is then tested on synthetic two-cluster data (N=2000 samples) with discussion of the factors impacting its convergence. Finally, it is compared to the Expectation Maximization identified mixture models with unique component form on the entire MNIST database (N=70000), and on real cardiac magnetic resonance data (N=276) to illustrate its value for imaging applications.

Paper Structure

This paper contains 21 sections, 11 equations, 11 figures, 6 tables, 3 algorithms.

Figures (11)

  • Figure 1: Synthetic experiment (N=2000 samples) to evaluate the performance of GICE on non-Gaussian CBMM identification.
  • Figure 2: Synthetic experiment (N=2000 samples) to evaluate the performance of GICE on GMM identification.
  • Figure 3: Experiment on the MNIST dataset (N=70000 samples, 10 clusters, 2D projection obtained by UMAP with KNN=30), to evaluate the performance of GICE on more than two clusters. Each point corresponds to the image of a digit, colored by its corresponding ground truth label.
  • Figure 4: Clustering of the projected infarct segments in LAD and RCA territories (N=276 samples). The illustrated projection was selected using the trustworthiness score venna2005local (the highest values corresponding to projections that best preserve local neighborhoods) among 50 UMAP projections with KNN = 6.
  • Figure 5: Infarct territory clustering result of GMM-EM and CBMM-GICE (realization time$T$=50, GMM initialization, 100 max. iterations, N=276 samples). The star points out the best result for each metric.
  • ...and 6 more figures