Establishing Validity for Distance Functions and Internal Clustering Validity Indices in Correlation Space
Isabella Degen, Zahraa S Abdallah, Kate Robson Brown, Henry W J Reeve
TL;DR
This work reframes clustering validity by arguing that ICVI performance depends on the underlying structure type rather than the dataset, and introduces a structure-type validity framework built around canonical correlation patterns and a nomological network. Using the CSTS synthetic benchmark, the authors formalize correlation patterns, derive 23 canonical patterns for three variables, and define level sets to capture theoretical similarity among patterns. They systematically evaluate 15 distance functions and four ICVIs, finding that simple Lp-based distances (not correlation-specific) paired with SWC and DBI yield valid measurements for correlation-pattern structure, while VRC and PBM fail under this structure. The study provides thresholds and practical guidance for correlation-based clustering validation (e.g., SWC>0.9, DBI<0.15) and offers a methodological template for establishing validity for other structure types, shifting the focus from dataset-centric rankings to structure-type grounded validity. This approach promotes principled ICVI selection and highlights the necessity of integrating nomological networks and structure-type definitions in clustering validity research.
Abstract
Internal clustering validity indices (ICVIs) assess clustering quality without ground truth labels. Comparative studies consistently find that no single ICVI outperforms others across datasets, leaving practitioners without principled ICVI selection. We argue that inconsistent ICVI performance arises because studies evaluate them based on matching human labels rather than measuring the quality of the discovered structure in the data, using datasets without formally quantifying the structure type and quality. Structure type refers to the mathematical organisation in data that clustering aims to discover. Validity theory requires a theoretical definition of clustering quality, which depends on structure type. We demonstrate this through the first validity assessment of clustering quality measures for correlation patterns, a structure type that arises from clustering time series by correlation relationships. We formalise 23 canonical correlation patterns as the theoretical optimal clustering and use synthetic data modelling this structure with controlled perturbations to evaluate validity across content, criterion, construct, and external validity. Our findings show that Silhouette Width Criterion (SWC) and Davies-Bouldin Index (DBI) are valid for correlation patterns, whilst Calinski-Harabasz (VRC) and Pakhira-Bandyopadhyay-Maulik (PBM) indices fail. Simple Lp norm distances achieve validity, whilst correlation-specific functions fail structural, criterion, and external validity. These results differ from previous studies where VRC and PBM performed well, demonstrating that validity depends on structure type. Our structure-type-specific validation method provides both practical guidance (quality thresholds SWC>0.9, DBI<0.15) and a methodological template for establishing validity for other structure types.
