Parameter-Free Clustering via Self-Supervised Consensus Maximization (Extended Version)
Lijun Zhang, Suyuan Liu, Siwei Wang, Shengju Yu, Xueling Zhu, Miaomiao Li, Xinwang Liu
TL;DR
SCMax tackles the longstanding problem of hyperparameter-sensitive clustering by delivering a fully parameter-free framework that unifies hierarchical agglomerative clustering with self-supervised representation learning and a nearest-neighbor consensus evaluation. At each merge step, a structure-aware representation is refined via a self-supervised task, and the Nearest Neighbor Consensus score measures alignment between merges in the original and self-supervised spaces to automatically identify the optimal cluster count $K^*$. The key contributions are (i) a parameter-free cluster-number generation via nearest-neighbor merging, (ii) a contrastive perturbation mechanism driven by cluster labels, (iii) the NNC metric for automatic structure evaluation without thresholds, and (iv) a scalable analysis showing competitive clustering performance and efficient computation across datasets. Together, these elements enable robust, scalable clustering without prior knowledge of the number of clusters, with practical impact for open-world data analysis and real-world deployments where Hyperparameters are hard to preset.
Abstract
Clustering is a fundamental task in unsupervised learning, but most existing methods heavily rely on hyperparameters such as the number of clusters or other sensitive settings, limiting their applicability in real-world scenarios. To address this long-standing challenge, we propose a novel and fully parameter-free clustering framework via Self-supervised Consensus Maximization, named SCMax. Our framework performs hierarchical agglomerative clustering and cluster evaluation in a single, integrated process. At each step of agglomeration, it creates a new, structure-aware data representation through a self-supervised learning task guided by the current clustering structure. We then introduce a nearest neighbor consensus score, which measures the agreement between the nearest neighbor-based merge decisions suggested by the original representation and the self-supervised one. The moment at which consensus maximization occurs can serve as a criterion for determining the optimal number of clusters. Extensive experiments on multiple datasets demonstrate that the proposed framework outperforms existing clustering approaches designed for scenarios with an unknown number of clusters.
