Utilization of Neighbor Information for Image Classification with Different Levels of Supervision
Gihan Jayatilaka, Abhinav Shrivastava, Matthew Gwilliam
TL;DR
The paper tackles the gap between fully supervised, semi-supervised (GCD), and unsupervised image recognition by proposing UNIC, a neighbor-information–driven framework that unifies clustering and GCD. It leverages a DINO-based ViT backbone to mine positive and negative neighbors and finetunes end-to-end with neighbor-aware losses, adapting naturally to GCD by using ground-truth neighbors for labelled classes. A novel second-order neighbor cleaning strategy and a dedicated negative-neighbor mining component enable effective clustering with a single clustering head, achieving state-of-the-art results on ImageNet-100, ImageNet-200, CUB-200, Aircrafts, and SCars for both clustering and GCD. The approach demonstrates strong open-world recognition potential, showing that carefully harnessed neighbor information can bridge supervised and unsupervised learning in image classification, with practical implications for scenarios with varying levels of labeling.
Abstract
We propose to bridge the gap between semi-supervised and unsupervised image recognition with a flexible method that performs well for both generalized category discovery (GCD) and image clustering. Despite the overlap in motivation between these tasks, the methods themselves are restricted to a single task -- GCD methods are reliant on the labeled portion of the data, and deep image clustering methods have no built-in way to leverage the labels efficiently. We connect the two regimes with an innovative approach that Utilizes Neighbor Information for Classification (UNIC) both in the unsupervised (clustering) and semisupervised (GCD) setting. State-of-the-art clustering methods already rely heavily on nearest neighbors. We improve on their results substantially in two parts, first with a sampling and cleaning strategy where we identify accurate positive and negative neighbors, and secondly by finetuning the backbone with clustering losses computed by sampling both types of neighbors. We then adapt this pipeline to GCD by utilizing the labelled images as ground truth neighbors. Our method yields state-of-the-art results for both clustering (+3% ImageNet-100, Imagenet200) and GCD (+0.8% ImageNet-100, +5% CUB, +2% SCars, +4% Aircraft).
