An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders
Scott C. Lowe, Joakim Bruslund Haurum, Sageev Oore, Thomas B. Moeslund, Graham W. Taylor
TL;DR
This work probes whether embeddings from pretrained encoders, including both supervised and self-supervised models, can form meaningful clusters on datasets unseen during training without any retraining. Using a comprehensive benchmark across 26 datasets and a mix of clustering algorithms, it reveals that supervised encoders excel near the training domain while SSL encoders gain relative advantage on far-out domains; fine-tuning SSL encoders often helps within-domain clustering but can hurt Far-OOD performance, with MAE frequently standing out when tuned. The study also shows that manifold-based dimensionality reduction, especially UMAP, enhances clustering and that the silhouette score in a reduced space correlates strongly with AMI, offering a practical no-ground-truth proxy. Overall, the results emphasize that clustering-based evaluation provides orthogonal insights into SSL representations, and guide practical strategies for zero-shot clustering of unseen data.
Abstract
Can pretrained models generalize to new datasets without any retraining? We deploy pretrained image models on datasets they were not trained for, and investigate whether their embeddings form meaningful clusters. Our suite of benchmarking experiments use encoders pretrained solely on ImageNet-1k with either supervised or self-supervised training techniques, deployed on image datasets that were not seen during training, and clustered with conventional clustering algorithms. This evaluation provides new insights into the embeddings of self-supervised models, which prioritize different features to supervised models. Supervised encoders typically offer more utility than SSL encoders within the training domain, and vice-versa far outside of it, however, fine-tuned encoders demonstrate the opposite trend. Clustering provides a way to evaluate the utility of self-supervised learned representations orthogonal to existing methods such as kNN. Additionally, we find the silhouette score when measured in a UMAP-reduced space is highly correlated with clustering performance, and can therefore be used as a proxy for clustering performance on data with no ground truth labels. Our code implementation is available at \url{https://github.com/scottclowe/zs-ssl-clustering/}.
