Interpretable Clustering with the Distinguishability Criterion
Ali Turfah, Xiaoquan Wen
TL;DR
The paper introduces the Distinguishability criterion $P_{mc}$ as a Bayes-risk-based measure of cluster separability, enabling rigorous validation and selection of cluster configurations across diverse clustering methods. It develops a combined loss framework that integrates $P_{mc}$ with existing criteria, and demonstrates a Finite Mixture Models-based PHM algorithm that merges components by prioritizing overlaps, producing interpretable dendrograms. The authors establish connections to internal validity indices, provide hypothesis-testing use with hierarchical clustering, and validate the approach on synthetic data and real-world examples (penguin morphology, HGDP genetics, and single-cell RNA-seq). They show that $P_{mc}$ can improve cluster count decisions in overlapping settings, offer scalable computation via Monte Carlo estimation, and align clustering structure with underlying biology or population structure. The work paves the way for broad applicability of $P_{mc}$ to model-based and heuristic clustering, with potential extensions to other latent-variable models and coalescent-type analyses.
Abstract
Cluster analysis is a popular unsupervised learning tool used in many disciplines to identify heterogeneous sub-populations within a sample. However, validating cluster analysis results and determining the number of clusters in a data set remains an outstanding problem. In this work, we present a global criterion called the Distinguishability criterion to quantify the separability of identified clusters and validate inferred cluster configurations. Our computational implementation of the Distinguishability criterion corresponds to the Bayes risk of a randomized classifier under the 0-1 loss. We propose a combined loss function-based computational framework that integrates the Distinguishability criterion with many commonly used clustering procedures, such as hierarchical clustering, k-means, and finite mixture models. We present these new algorithms as well as the results from comprehensive data analysis based on simulation studies and real data applications.
