Table of Contents
Fetching ...

An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders

Scott C. Lowe, Joakim Bruslund Haurum, Sageev Oore, Thomas B. Moeslund, Graham W. Taylor

TL;DR

This work probes whether embeddings from pretrained encoders, including both supervised and self-supervised models, can form meaningful clusters on datasets unseen during training without any retraining. Using a comprehensive benchmark across 26 datasets and a mix of clustering algorithms, it reveals that supervised encoders excel near the training domain while SSL encoders gain relative advantage on far-out domains; fine-tuning SSL encoders often helps within-domain clustering but can hurt Far-OOD performance, with MAE frequently standing out when tuned. The study also shows that manifold-based dimensionality reduction, especially UMAP, enhances clustering and that the silhouette score in a reduced space correlates strongly with AMI, offering a practical no-ground-truth proxy. Overall, the results emphasize that clustering-based evaluation provides orthogonal insights into SSL representations, and guide practical strategies for zero-shot clustering of unseen data.

Abstract

Can pretrained models generalize to new datasets without any retraining? We deploy pretrained image models on datasets they were not trained for, and investigate whether their embeddings form meaningful clusters. Our suite of benchmarking experiments use encoders pretrained solely on ImageNet-1k with either supervised or self-supervised training techniques, deployed on image datasets that were not seen during training, and clustered with conventional clustering algorithms. This evaluation provides new insights into the embeddings of self-supervised models, which prioritize different features to supervised models. Supervised encoders typically offer more utility than SSL encoders within the training domain, and vice-versa far outside of it, however, fine-tuned encoders demonstrate the opposite trend. Clustering provides a way to evaluate the utility of self-supervised learned representations orthogonal to existing methods such as kNN. Additionally, we find the silhouette score when measured in a UMAP-reduced space is highly correlated with clustering performance, and can therefore be used as a proxy for clustering performance on data with no ground truth labels. Our code implementation is available at \url{https://github.com/scottclowe/zs-ssl-clustering/}.

An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders

TL;DR

This work probes whether embeddings from pretrained encoders, including both supervised and self-supervised models, can form meaningful clusters on datasets unseen during training without any retraining. Using a comprehensive benchmark across 26 datasets and a mix of clustering algorithms, it reveals that supervised encoders excel near the training domain while SSL encoders gain relative advantage on far-out domains; fine-tuning SSL encoders often helps within-domain clustering but can hurt Far-OOD performance, with MAE frequently standing out when tuned. The study also shows that manifold-based dimensionality reduction, especially UMAP, enhances clustering and that the silhouette score in a reduced space correlates strongly with AMI, offering a practical no-ground-truth proxy. Overall, the results emphasize that clustering-based evaluation provides orthogonal insights into SSL representations, and guide practical strategies for zero-shot clustering of unseen data.

Abstract

Can pretrained models generalize to new datasets without any retraining? We deploy pretrained image models on datasets they were not trained for, and investigate whether their embeddings form meaningful clusters. Our suite of benchmarking experiments use encoders pretrained solely on ImageNet-1k with either supervised or self-supervised training techniques, deployed on image datasets that were not seen during training, and clustered with conventional clustering algorithms. This evaluation provides new insights into the embeddings of self-supervised models, which prioritize different features to supervised models. Supervised encoders typically offer more utility than SSL encoders within the training domain, and vice-versa far outside of it, however, fine-tuned encoders demonstrate the opposite trend. Clustering provides a way to evaluate the utility of self-supervised learned representations orthogonal to existing methods such as kNN. Additionally, we find the silhouette score when measured in a UMAP-reduced space is highly correlated with clustering performance, and can therefore be used as a proxy for clustering performance on data with no ground truth labels. Our code implementation is available at \url{https://github.com/scottclowe/zs-ssl-clustering/}.
Paper Structure (43 sections, 3 equations, 10 figures, 26 tables)

This paper contains 43 sections, 3 equations, 10 figures, 26 tables.

Figures (10)

  • Figure 1: Percentage-point (p.p.) difference in AMI between clusters formed from SSL encoder embeddings versus supervised encoder embeddings. We compare the quality of clustering of each dataset (mean AMI over 6 clusterers) using SSL encoder embeddings against that of encoders trained with cross-entropy on IN-1k. We present the mean across datasets in each group (error bars: ±1 stderr; $3\!\le\!N\!\le\!8$ datasets).
  • Figure 2: AMI scores across taxonomic levels. We measure the AMI score at each of the 7 taxonomic levels of the iNaturalist-21 dataset and from order to species level as well as when using the Barcode Index Number (BIN) as a proxy for subspecies labels for the BIOSCAN-1M dataset. The scores are reported for each encoder, averaged over the tested clustering methods.
  • Figure 3: Percentage-point (p.p.) difference in AMI between clusters formed from embeddings of SSL-pretrained networks fine-tuned on IN-1k versus fully-supervised networks. We measure the difference in AMI (mean over 6 clusterers) with fine-tuned SSL encoders as compared to encoders trained with cross-entropy on IN-1k (error bars: ±1 stderr; $3\!\le\!N\!\le\!8$ datasets). Note: The x-scale differs from that used in \ref{['fig:enc-delta']}, but the baseline (0 values) are the same.
  • Figure 4: Ranked AMI--Silhouette scatter plots. The ranked AMI and silhouette score ($S$) per clusterer, across datasets and encoders (higher is better). The silhouette scores are measured in the original (top) and UMAP-reduced 50-d (bottom) feature spaces. We indicate the per-clustering-method Spearman's rank correlation ($\rho$).
  • Figure 5: Percentage of variance explained by PCA-reduced embeddings. We show the fraction of the total variance of the data which is explained by the first $N$ PCA dimensions. The number of dimensions included is represented both in absolute terms (upper x-axes) and relative to the number of dimensions of the original embeddings (lower x-axes).
  • ...and 5 more figures