Table of Contents
Fetching ...

DCSI -- An improved measure of cluster separability based on separation and connectedness

Jana Gauss, Fabian Scheipl, Moritz Herrmann

TL;DR

DCSI introduces a density-aware separability index that combines a robust separation measure based on core points with a connectivity measure derived from MSTs over core points. By normalizing the ratio of separation to connectedness, and aggregating over class pairs, DCSI evaluates how well a given partition reflects density-based clusters without bias toward convex shapes. Extensive synthetic and real-world experiments show DCSI correlates strongly with DBSCAN performance (ARI) on raw data and can detect when class labels do not map to meaningful density-based components, highlighting its value as both a separability diagnostic and a cluster validity index. The work also discusses parameter sensitivity, embedding effects, and practical considerations for high-dimensional data, offering a principled tool for evaluating and guiding density-based clustering in applied contexts.

Abstract

Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. The central aspects of separability for density-based clustering are between-class separation and within-class connectedness, and neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate them. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted Rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlapping classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlapping classes that do not correspond to meaningful density-based clusters.

DCSI -- An improved measure of cluster separability based on separation and connectedness

TL;DR

DCSI introduces a density-aware separability index that combines a robust separation measure based on core points with a connectivity measure derived from MSTs over core points. By normalizing the ratio of separation to connectedness, and aggregating over class pairs, DCSI evaluates how well a given partition reflects density-based clusters without bias toward convex shapes. Extensive synthetic and real-world experiments show DCSI correlates strongly with DBSCAN performance (ARI) on raw data and can detect when class labels do not map to meaningful density-based components, highlighting its value as both a separability diagnostic and a cluster validity index. The work also discusses parameter sensitivity, embedding effects, and practical considerations for high-dimensional data, offering a principled tool for evaluating and guiding density-based clustering in applied contexts.

Abstract

Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. The central aspects of separability for density-based clustering are between-class separation and within-class connectedness, and neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate them. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted Rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlapping classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlapping classes that do not correspond to meaningful density-based clusters.
Paper Structure (48 sections, 30 equations, 16 figures, 6 tables)

This paper contains 48 sections, 30 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Separability from a classification- vs clustering-based view
  • Figure 2: Data of a class with two modes, $n = 500$ (left) and core points and connectedness for different choices of $\varepsilon$ (right). $\varepsilon$ is the $q$-quantile of the distances to the 10th nearest neighbor for $q \in \{0.1, 0.2, 0.3, 0.5, 0.6, 0.8\}$. The obtained core points (with $\textit{MinPts} = 5$) are shown in blue and the two core points that determine the connectedness are shown black. This example emphasizes that there are no "true" values of $\mathop{\mathrm{Sep}}\nolimits$, $\mathop{\mathrm{Conn}}\nolimits$ and $\mathop{\mathrm{DCSI}}\nolimits$ and therefore no globally applicable "right" or "optimal" choice of the parameters.
  • Figure 3: Well separated two-class data set, $n_1 = n_2 = 500$ (left) and obtained values of connectedness, separation and DCSI for different $\varepsilon_i$ (right). $\varepsilon_i$ is the $q$-quantile of the distances to the 10th nearest neighbor for $q = 0.1, 0.2, \ldots, 0.9$. For these clearly separated clusters, the dependence of the measures on the specific hyperparameter values is very small.
  • Figure 4: Exemplary data sets to evaluate separability measures
  • Figure 5: Spearman correlation of separability measures and ARI for all 6298 synthetic data sets. See the text ("Overall results") and the caption of Figure \ref{['fig:corApp']} for more details.
  • ...and 11 more figures

Theorems & Definitions (19)

  • Remark 1.1
  • Definition 3.1: Core points DCSI
  • Definition 3.2: Separation DCSI
  • Definition 3.3: Connectedness DCSI
  • Definition 3.4: DCSI, pairwise
  • Definition 3.5: DCSI, multi-class
  • Definition 3.6: Proposed choice of $\varepsilon_i$
  • Definition D.1: Dunn index
  • Definition D.2: Calinski-Harabasz index
  • Definition D.3: Davies-Bouldin index
  • ...and 9 more