Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles
Melih Baydar, Emre Akbas
TL;DR
ICCE tackles fully unsupervised image classification by leveraging an adaptive nearest-neighbor strategy, multi-head clustering, and cluster ensembles to produce a robust consensus pseudo-labeling and self-training pipeline. By combining a frozen ViT-based encoder with enhanced training objectives (including a cross-entropy term), feature-space improvements, and Sinkhorn-centered head outputs, ICCE attains state-of-the-art results across ten benchmarks and surpasses 70% accuracy on ImageNet without supervision. The key contributions are the adaptive nearest neighbor selection, the cluster-ensembling framework to fuse diverse head outputs, and a self-training stage that solidifies pseudo-label quality for final inference. The approach demonstrates strong practical impact by narrowing the gap to supervised methods and providing a scalable, fully unsupervised pathway for large-scale image understanding.
Abstract
Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, "Image Clustering through Cluster Ensembles" (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.
