DCSI -- An improved measure of cluster separability based on separation and connectedness
Jana Gauss, Fabian Scheipl, Moritz Herrmann
TL;DR
DCSI introduces a density-aware separability index that combines a robust separation measure based on core points with a connectivity measure derived from MSTs over core points. By normalizing the ratio of separation to connectedness, and aggregating over class pairs, DCSI evaluates how well a given partition reflects density-based clusters without bias toward convex shapes. Extensive synthetic and real-world experiments show DCSI correlates strongly with DBSCAN performance (ARI) on raw data and can detect when class labels do not map to meaningful density-based components, highlighting its value as both a separability diagnostic and a cluster validity index. The work also discusses parameter sensitivity, embedding effects, and practical considerations for high-dimensional data, offering a principled tool for evaluating and guiding density-based clustering in applied contexts.
Abstract
Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. The central aspects of separability for density-based clustering are between-class separation and within-class connectedness, and neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate them. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted Rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlapping classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlapping classes that do not correspond to meaningful density-based clusters.
