Table of Contents
Fetching ...

Establishing Validity for Distance Functions and Internal Clustering Validity Indices in Correlation Space

Isabella Degen, Zahraa S Abdallah, Kate Robson Brown, Henry W J Reeve

TL;DR

This work reframes clustering validity by arguing that ICVI performance depends on the underlying structure type rather than the dataset, and introduces a structure-type validity framework built around canonical correlation patterns and a nomological network. Using the CSTS synthetic benchmark, the authors formalize correlation patterns, derive 23 canonical patterns for three variables, and define level sets to capture theoretical similarity among patterns. They systematically evaluate 15 distance functions and four ICVIs, finding that simple Lp-based distances (not correlation-specific) paired with SWC and DBI yield valid measurements for correlation-pattern structure, while VRC and PBM fail under this structure. The study provides thresholds and practical guidance for correlation-based clustering validation (e.g., SWC>0.9, DBI<0.15) and offers a methodological template for establishing validity for other structure types, shifting the focus from dataset-centric rankings to structure-type grounded validity. This approach promotes principled ICVI selection and highlights the necessity of integrating nomological networks and structure-type definitions in clustering validity research.

Abstract

Internal clustering validity indices (ICVIs) assess clustering quality without ground truth labels. Comparative studies consistently find that no single ICVI outperforms others across datasets, leaving practitioners without principled ICVI selection. We argue that inconsistent ICVI performance arises because studies evaluate them based on matching human labels rather than measuring the quality of the discovered structure in the data, using datasets without formally quantifying the structure type and quality. Structure type refers to the mathematical organisation in data that clustering aims to discover. Validity theory requires a theoretical definition of clustering quality, which depends on structure type. We demonstrate this through the first validity assessment of clustering quality measures for correlation patterns, a structure type that arises from clustering time series by correlation relationships. We formalise 23 canonical correlation patterns as the theoretical optimal clustering and use synthetic data modelling this structure with controlled perturbations to evaluate validity across content, criterion, construct, and external validity. Our findings show that Silhouette Width Criterion (SWC) and Davies-Bouldin Index (DBI) are valid for correlation patterns, whilst Calinski-Harabasz (VRC) and Pakhira-Bandyopadhyay-Maulik (PBM) indices fail. Simple Lp norm distances achieve validity, whilst correlation-specific functions fail structural, criterion, and external validity. These results differ from previous studies where VRC and PBM performed well, demonstrating that validity depends on structure type. Our structure-type-specific validation method provides both practical guidance (quality thresholds SWC>0.9, DBI<0.15) and a methodological template for establishing validity for other structure types.

Establishing Validity for Distance Functions and Internal Clustering Validity Indices in Correlation Space

TL;DR

This work reframes clustering validity by arguing that ICVI performance depends on the underlying structure type rather than the dataset, and introduces a structure-type validity framework built around canonical correlation patterns and a nomological network. Using the CSTS synthetic benchmark, the authors formalize correlation patterns, derive 23 canonical patterns for three variables, and define level sets to capture theoretical similarity among patterns. They systematically evaluate 15 distance functions and four ICVIs, finding that simple Lp-based distances (not correlation-specific) paired with SWC and DBI yield valid measurements for correlation-pattern structure, while VRC and PBM fail under this structure. The study provides thresholds and practical guidance for correlation-based clustering validation (e.g., SWC>0.9, DBI<0.15) and offers a methodological template for establishing validity for other structure types, shifting the focus from dataset-centric rankings to structure-type grounded validity. This approach promotes principled ICVI selection and highlights the necessity of integrating nomological networks and structure-type definitions in clustering validity research.

Abstract

Internal clustering validity indices (ICVIs) assess clustering quality without ground truth labels. Comparative studies consistently find that no single ICVI outperforms others across datasets, leaving practitioners without principled ICVI selection. We argue that inconsistent ICVI performance arises because studies evaluate them based on matching human labels rather than measuring the quality of the discovered structure in the data, using datasets without formally quantifying the structure type and quality. Structure type refers to the mathematical organisation in data that clustering aims to discover. Validity theory requires a theoretical definition of clustering quality, which depends on structure type. We demonstrate this through the first validity assessment of clustering quality measures for correlation patterns, a structure type that arises from clustering time series by correlation relationships. We formalise 23 canonical correlation patterns as the theoretical optimal clustering and use synthetic data modelling this structure with controlled perturbations to evaluate validity across content, criterion, construct, and external validity. Our findings show that Silhouette Width Criterion (SWC) and Davies-Bouldin Index (DBI) are valid for correlation patterns, whilst Calinski-Harabasz (VRC) and Pakhira-Bandyopadhyay-Maulik (PBM) indices fail. Simple Lp norm distances achieve validity, whilst correlation-specific functions fail structural, criterion, and external validity. These results differ from previous studies where VRC and PBM performed well, demonstrating that validity depends on structure type. Our structure-type-specific validation method provides both practical guidance (quality thresholds SWC>0.9, DBI<0.15) and a methodological template for establishing validity for other structure types.

Paper Structure

This paper contains 53 sections, 43 equations, 10 figures, 34 tables.

Figures (10)

  • Figure 1: Correlation elliptope for three variables from different perspectives. The 23 coloured points show the maximally distinct canonical correlation patterns. Arrows indicate patterns lying on coordinate planes through the origin.
  • Figure 2: Multivariate time series $\mathbf{X}$ is segmented when correlation regimes change. Each segment's ($\mathbf{S}_{m}$) correlation matrix ($\mathbf{A}_{m}$) captures relationships between variate pairs ($Q$). Segments with similar patterns are clustered ($C_{k}$), then mapped to the most similar canonical patterns ($\mathbf{P}_{\ell}$) for validation and interpretation.
  • Figure 3: Example regime change (red line) between segments from canonical pattern 5 $[0,1,-1]$ to 15 $[1,-1,0]$ for the four generation stages: raw = uncorrelated data, normal = canonical correlation patterns, non-normal = distribution-shifted data, downsampled = temporally aggregated data. Empirical correlation coefficients ($a_{12}, a_{13}, a_{23}$) shown above each segment.
  • Figure 4: Descriptive statistics of Jaccard Index, number of segments assigned to wrong clusters, and number of observations shifted for 66 lower quality clusterings and the ground truth across the 30 exploratory subjects.
  • Figure 5: Structural validity performance four valid ($d_{L_{1}}$, $d_{L_{3}}$, $d_{L_{\text{dot}_{2}}}$) and the invalid $d_{L_{\infty}}$ distance functions for tests 1, and 3-5 (columns) for the normal complete and sparse data variants. Dashed lines indicate validity thresholds. Green shading = valid region, grey = invalid. Box plots show median, interquartile range, and outliers for the 30 exploratory subjects.
  • ...and 5 more figures