Table of Contents
Fetching ...

Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles

Melih Baydar, Emre Akbas

TL;DR

ICCE tackles fully unsupervised image classification by leveraging an adaptive nearest-neighbor strategy, multi-head clustering, and cluster ensembles to produce a robust consensus pseudo-labeling and self-training pipeline. By combining a frozen ViT-based encoder with enhanced training objectives (including a cross-entropy term), feature-space improvements, and Sinkhorn-centered head outputs, ICCE attains state-of-the-art results across ten benchmarks and surpasses 70% accuracy on ImageNet without supervision. The key contributions are the adaptive nearest neighbor selection, the cluster-ensembling framework to fuse diverse head outputs, and a self-training stage that solidifies pseudo-label quality for final inference. The approach demonstrates strong practical impact by narrowing the gap to supervised methods and providing a scalable, fully unsupervised pathway for large-scale image understanding.

Abstract

Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, "Image Clustering through Cluster Ensembles" (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.

Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles

TL;DR

ICCE tackles fully unsupervised image classification by leveraging an adaptive nearest-neighbor strategy, multi-head clustering, and cluster ensembles to produce a robust consensus pseudo-labeling and self-training pipeline. By combining a frozen ViT-based encoder with enhanced training objectives (including a cross-entropy term), feature-space improvements, and Sinkhorn-centered head outputs, ICCE attains state-of-the-art results across ten benchmarks and surpasses 70% accuracy on ImageNet without supervision. The key contributions are the adaptive nearest neighbor selection, the cluster-ensembling framework to fuse diverse head outputs, and a self-training stage that solidifies pseudo-label quality for final inference. The approach demonstrates strong practical impact by narrowing the gap to supervised methods and providing a scalable, fully unsupervised pathway for large-scale image understanding.

Abstract

Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, "Image Clustering through Cluster Ensembles" (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.

Paper Structure

This paper contains 27 sections, 7 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: Comparison with state-of-the-art. The previous best accuracies from the literature along with our results. Our method ICCE achieves the best unsupervised image classification performance across ten datasets. Performance improvement (in terms of percentage points) of ICCE over the previous best are shown on top of blue bars.
  • Figure 2: Overview of the training pipeline for ICCE. Our method consists of three stages. Stage 1 - Unsupervised classifier training: multiple clustering heads are trained on top of the representations output by a pretrained, frozen encoder. Here $x$ is an unlabeled image and $S_x$ is the adaptively selected nearest neighbor set of $x$. Multiple heads are trained to maximize the probability of $x$ and a randomly selected neighbor $x'$ having the same label (Equation \ref{['eq:ICCE_loss']}). Stage 2 - Cluster ensembling: different, potentially conflicting clusterings are unified through cluster ensembling (Equation \ref{['eq:cluster_ensemble_objective']}). Stage 3 - Self-training: consensus clustering result is used as pseudo-labels and an image classifier is trained. This classifier uses the same frozen encoder from Stage 1. For inference, only the classifier trained in Stage 3 is used.
  • Figure 3: Nearest Neighbor Accuracy Analysis on Various Datasets with DINOv2 ViT-L/14. Nearest neighbor accuracy remains very high across all datasets when the distance threshold is set to a higher value. Although the NN accuracy decreases with lower distance threshold values, it remains sufficiently high to achieve performance improvements when used in the nearest neighbor selection process. Best viewed when zoomed in.
  • Figure 4: Cluster predictions for images from randomly selected classes in the ImageNet dataset. Each row contains images that have been assigned to the same cluster by ICCE. The ground-truth label is displayed in the text below each image. The first four columns represent correctly classified images, while the last three columns correspond to misclassified ones.