A High-Performance External Validity Index for Clustering with a Large Number of Clusters
Mohammad Yasin Karbasian, Ramin Javadi
TL;DR
This work tackles the computational bottleneck of external clustering validity, particularly when the number of clusters is large. It introduces Stable Matching Based Pairing (SMBP), which uses stable matching to pair clusters across clustering results based on a contingency matrix, achieving a theoretical $O(N^2)$ runtime compared to the traditional $O(N^3)$ of maximum weighted matching. Empirically, SMBP delivers accuracy comparable to MWM and MMM while dramatically reducing runtime, especially on large-scale datasets with many clusters, and remains compatible with PyTorch and TensorFlow. The approach is validated on real and synthetic data, demonstrates strong scalability, and supports comparing cluster groups with different numbers of clusters, making it a practical tool for modern big-data clustering evaluation.
Abstract
This paper introduces the Stable Matching Based Pairing (SMBP) algorithm, a high-performance external validity index for clustering evaluation in large-scale datasets with a large number of clusters. SMBP leverages the stable matching framework to pair clusters across different clustering methods, significantly reducing computational complexity to $O(N^2)$, compared to traditional Maximum Weighted Matching (MWM) with $O(N^3)$ complexity. Through comprehensive evaluations on real-world and synthetic datasets, SMBP demonstrates comparable accuracy to MWM and superior computational efficiency. It is particularly effective for balanced, unbalanced, and large-scale datasets with a large number of clusters, making it a scalable and practical solution for modern clustering tasks. Additionally, SMBP is easily implementable within machine learning frameworks like PyTorch and TensorFlow, offering a robust tool for big data applications. The algorithm is validated through extensive experiments, showcasing its potential as a powerful alternative to existing methods such as Maximum Match Measure (MMM) and Centroid Ratio (CR).
