ClustRecNet: A Novel End-to-End Deep Learning Framework for Clustering Algorithm Recommendation
Mohammadreza Bakhtyari, Bogdan Mazoure, Renato Cordeiro de Amorim, Guillaume Rabusseau, Vladimir Makarenkov
TL;DR
Clustering algorithm selection for unlabeled data is challenging due to dataset heterogeneity and limitations of cluster validity indices. The authors propose ClustRecNet, an end-to-end deep learning framework that maps raw dataset representations to clustering algorithm recommendations, trained on a large synthetic repository of 34,000 datasets with ground-truth ARI labels. The model combines a CNN–ResNet–Attention architecture to capture local and global data structure, enabling direct algorithm recommendation without handcrafted meta-features. Empirical evaluations on synthetic and real-world data show consistent ARI gains over traditional CVIs and state-of-the-art AutoML clustering methods, underscoring the approach's practical impact for unsupervised learning tasks.
Abstract
We introduce ClustRecNet - a novel deep learning (DL)-based recommendation framework for determining the most suitable clustering algorithms for a given dataset, addressing the long-standing challenge of clustering algorithm selection in unsupervised learning. To enable supervised learning in this context, we construct a comprehensive data repository comprising 34,000 synthetic datasets with diverse structural properties. Each of them was processed using 10 popular clustering algorithms. The resulting clusterings were assessed via the Adjusted Rand Index (ARI) to establish ground truth labels, used for training and evaluation of our DL model. The proposed network architecture integrates convolutional, residual, and attention mechanisms to capture both local and global structural patterns from the input data. This design supports end-to-end training to learn compact representations of datasets and enables direct recommendation of the most suitable clustering algorithm, reducing reliance on handcrafted meta-features and traditional Cluster Validity Indices (CVIs). Comprehensive experiments across synthetic and real-world benchmarks demonstrate that our DL model consistently outperforms conventional CVIs (e.g. Silhouette, Calinski-Harabasz, Davies-Bouldin, and Dunn) as well as state-of-the-art AutoML clustering recommendation approaches (e.g. ML2DAC, AutoCluster, and AutoML4Clust). Notably, the proposed model achieves a 0.497 ARI improvement over the Calinski-Harabasz index on synthetic data and a 15.3% ARI gain over the best-performing AutoML approach on real-world data.
