Table of Contents
Fetching ...

Benchmarking of Clustering Validity Measures Revisited

Connor Simpson, Ricardo J. G. B. Campello, Elizabeth Stojanovski

TL;DR

This paper addresses the lack of a universally reliable internal clustering validity index by performing a large-scale benchmark of 26 internal indexes across 16177 synthetic datasets generated with eight clustering algorithms. It advances methodology by introducing three complementary evaluation schemes and using rank-based correlations with aggregated external rankings to mitigate non-linear biases. Key findings show that no single index dominates across all problems; performance strongly depends on the clustering algorithm and data properties, with non-linear relationships between internal and external assessments common. Practically, the work provides guidance on selecting index ensembles tailored to the specific clustering setup and highlights the importance of dataset representativeness in benchmarking internal validity measures.

Abstract

Validation plays a crucial role in the clustering process. Many different internal validity indexes exist for the purpose of determining the best clustering solution(s) from a given collection of candidates, e.g., as produced by different algorithms or different algorithm hyper-parameters. In this study, we present a comprehensive benchmark study of 26 internal validity indexes, which includes highly popular classic indexes as well as more recently developed ones. We adopted an enhanced revision of the methodology presented in Vendramin et al. (2010), developed here to address several shortcomings of this previous work. This overall new approach consists of three complementary custom-tailored evaluation sub-methodologies, each of which has been designed to assess specific aspects of an index's behaviour while preventing potential biases of the other sub-methodologies. Each sub-methodology features two complementary measures of performance, alongside mechanisms that allow for an in-depth investigation of more complex behaviours of the internal validity indexes under study. Additionally, a new collection of 16177 datasets has been produced, paired with eight widely-used clustering algorithms, for a wider applicability scope and representation of more diverse clustering scenarios.

Benchmarking of Clustering Validity Measures Revisited

TL;DR

This paper addresses the lack of a universally reliable internal clustering validity index by performing a large-scale benchmark of 26 internal indexes across 16177 synthetic datasets generated with eight clustering algorithms. It advances methodology by introducing three complementary evaluation schemes and using rank-based correlations with aggregated external rankings to mitigate non-linear biases. Key findings show that no single index dominates across all problems; performance strongly depends on the clustering algorithm and data properties, with non-linear relationships between internal and external assessments common. Practically, the work provides guidance on selecting index ensembles tailored to the specific clustering setup and highlights the importance of dataset representativeness in benchmarking internal validity measures.

Abstract

Validation plays a crucial role in the clustering process. Many different internal validity indexes exist for the purpose of determining the best clustering solution(s) from a given collection of candidates, e.g., as produced by different algorithms or different algorithm hyper-parameters. In this study, we present a comprehensive benchmark study of 26 internal validity indexes, which includes highly popular classic indexes as well as more recently developed ones. We adopted an enhanced revision of the methodology presented in Vendramin et al. (2010), developed here to address several shortcomings of this previous work. This overall new approach consists of three complementary custom-tailored evaluation sub-methodologies, each of which has been designed to assess specific aspects of an index's behaviour while preventing potential biases of the other sub-methodologies. Each sub-methodology features two complementary measures of performance, alongside mechanisms that allow for an in-depth investigation of more complex behaviours of the internal validity indexes under study. Additionally, a new collection of 16177 datasets has been produced, paired with eight widely-used clustering algorithms, for a wider applicability scope and representation of more diverse clustering scenarios.

Paper Structure

This paper contains 20 sections, 3 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Example partitions produced by K-Means for a dataset with four ground-truth clusters, where the partition with $k=3$ clusters (left) produces a better solution compared to the solution with $k=4$ clusters (right). The groupings of the partitions are represented by the shape and color of the observations.
  • Figure 2: Scatter plots of the external Jaccard index against two internal indexes, Silhouette (left) and Dunn (right). Each point represents a partition, where the number indicates the number of clusters within that partition, and the color indicates if the number of clusters is less than (black), equal to (red) or greater than (green) the ground-truth number of clusters. Both indexes select the ground-truth solution (top-right corner) as the best partition, however, when considering all the other candidate solutions the Pearson correlation with the Jaccard index is 0.77 for the Silhouette index and only 0.04 for the Dunn index.
  • Figure 3: External index (Jaccard) plotted against an internal index (VRC) for a test dataset. A non-linear relationship can be seen to form two regions in the plot based on the partition containing a larger ($k > k^*$) or smaller ($k < k^*$) number of clusters compared to the ground-truth partition.
  • Figure 4: Ratkowsky-Lance plotted against an external (Jaccard) index for a test dataset. The internal index exhibits a monotonically decreasing behaviour as a function of the number of clusters, which still results in a high Pearson correlation of 0.93.
  • Figure 5: Two partitions of a dataset with 4 ground-truth clusters. Both partitions (A and B) have identical values of the Adjusted Rand Index: ARI = 0.51.
  • ...and 12 more figures