Table of Contents
Fetching ...

Interpretable Clustering with the Distinguishability Criterion

Ali Turfah, Xiaoquan Wen

TL;DR

The paper introduces the Distinguishability criterion $P_{mc}$ as a Bayes-risk-based measure of cluster separability, enabling rigorous validation and selection of cluster configurations across diverse clustering methods. It develops a combined loss framework that integrates $P_{mc}$ with existing criteria, and demonstrates a Finite Mixture Models-based PHM algorithm that merges components by prioritizing overlaps, producing interpretable dendrograms. The authors establish connections to internal validity indices, provide hypothesis-testing use with hierarchical clustering, and validate the approach on synthetic data and real-world examples (penguin morphology, HGDP genetics, and single-cell RNA-seq). They show that $P_{mc}$ can improve cluster count decisions in overlapping settings, offer scalable computation via Monte Carlo estimation, and align clustering structure with underlying biology or population structure. The work paves the way for broad applicability of $P_{mc}$ to model-based and heuristic clustering, with potential extensions to other latent-variable models and coalescent-type analyses.

Abstract

Cluster analysis is a popular unsupervised learning tool used in many disciplines to identify heterogeneous sub-populations within a sample. However, validating cluster analysis results and determining the number of clusters in a data set remains an outstanding problem. In this work, we present a global criterion called the Distinguishability criterion to quantify the separability of identified clusters and validate inferred cluster configurations. Our computational implementation of the Distinguishability criterion corresponds to the Bayes risk of a randomized classifier under the 0-1 loss. We propose a combined loss function-based computational framework that integrates the Distinguishability criterion with many commonly used clustering procedures, such as hierarchical clustering, k-means, and finite mixture models. We present these new algorithms as well as the results from comprehensive data analysis based on simulation studies and real data applications.

Interpretable Clustering with the Distinguishability Criterion

TL;DR

The paper introduces the Distinguishability criterion as a Bayes-risk-based measure of cluster separability, enabling rigorous validation and selection of cluster configurations across diverse clustering methods. It develops a combined loss framework that integrates with existing criteria, and demonstrates a Finite Mixture Models-based PHM algorithm that merges components by prioritizing overlaps, producing interpretable dendrograms. The authors establish connections to internal validity indices, provide hypothesis-testing use with hierarchical clustering, and validate the approach on synthetic data and real-world examples (penguin morphology, HGDP genetics, and single-cell RNA-seq). They show that can improve cluster count decisions in overlapping settings, offer scalable computation via Monte Carlo estimation, and align clustering structure with underlying biology or population structure. The work paves the way for broad applicability of to model-based and heuristic clustering, with potential extensions to other latent-variable models and coalescent-type analyses.

Abstract

Cluster analysis is a popular unsupervised learning tool used in many disciplines to identify heterogeneous sub-populations within a sample. However, validating cluster analysis results and determining the number of clusters in a data set remains an outstanding problem. In this work, we present a global criterion called the Distinguishability criterion to quantify the separability of identified clusters and validate inferred cluster configurations. Our computational implementation of the Distinguishability criterion corresponds to the Bayes risk of a randomized classifier under the 0-1 loss. We propose a combined loss function-based computational framework that integrates the Distinguishability criterion with many commonly used clustering procedures, such as hierarchical clustering, k-means, and finite mixture models. We present these new algorithms as well as the results from comprehensive data analysis based on simulation studies and real data applications.
Paper Structure (27 sections, 1 theorem, 37 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 1 theorem, 37 equations, 11 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Merging two existing clusters indexed by $i$ and $j$ leads to Furthermore,

Figures (11)

  • Figure 1: (Left) 600 simulated observations drawn from a mixture of six two-dimensional Gaussian distributions. Colors indicate the cluster assignment labels to each of the six estimated mixture components. (Right) Heatmap visualizing $\Delta P_{\rm mc}$ values for the estimated mixture components. The intensity of the color indicates the relative proportion of $P_{\rm mc}$ contributed by the overlap between these components (i.e., $\Delta P_{\rm mc}^{(i, j)}$).
  • Figure 1: Values of $P_{\rm mc}$ based on the randomized and optimal decision rules $\delta_r$ and $\delta_o$. The value $P_{\rm mc}$ is shown in the y-axis and is calculated for two univariate Gaussian distributions $N(\mu_1, \sigma)$ and $N(\mu_2, \sigma)$ where $\pi_1 = \pi_2 = 0.5$. The x-axis indicates the degree of cluster separation in terms of the distribution parameters.
  • Figure 2: (Left) 450 simulated observations drawn from a mixture of three Gaussian distributions. Color indicates true generating distribution while shape indicates the assigned $k$-means cluster. (Center and Right) Value of the gap statistic, $P_{\rm mc}$, and the Silhouette index for different numbers of clusters based on the $k$ means clustering partition with Gaussian cluster distributions.
  • Figure 2: Distribution plots for two univariate Gaussian distributions $N(0, 1)$ (solid line) and $N(\mu, 1)$ (dashed line) at decreasing values of $P_{\rm mc}$, where $\pi_1 = \pi_2 = 0.5$. The distance between the two centroids, $|\mu|$, determines the specific $P_{\rm mc}$ value.
  • Figure 3: (Top) The distribution of $p$-values based on $P_{\rm mc}$, Gao et al.'s selective inference procedure, and the two-sample t-test for 5,000 simulation replicates. (Bottom) Power comparison between $P_{\rm mc}$ and Gao et al.'s method to detect the presence of two Gaussian clusters controlling the type I error rate at level $\alpha = 0.05$ as the cluster separability increases. The power at each value $|\mu_1 - \mu_2| / \sigma$ is calculated based on 500 simulation replicates.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Proposition 1
  • proof
  • proof