Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey
Oriane Siméoni, Éloi Zablocki, Spyros Gidaris, Gilles Puy, Patrick Pérez
TL;DR
This paper surveys unsupervised object localization in the era of self-supervised ViTs, focusing on class-agnostic discovery without manual annotations. It categorizes tasks and metrics, reviews training-free localization methods that exploit patch correlations and graph-based techniques, and discusses training-with-pseudo-labels approaches that leverage pseudo-supervision to improve multi-object detection and instance segmentation. It highlights key methods (eg, LOST, TokenCut, MOVE, UMOD, MOST, FOUND) and the crucial role of self-supervised features (notably DINO) in enabling localization, while also addressing post-processing and fine-tuning strategies that boost performance. The survey concludes with limitations and future directions, including open-vocabulary extensions, multimodal signals, scene-centric data, and learning object-centric representations to advance robust, class-agnostic localization in real-world settings.
Abstract
The recent enthusiasm for open-world vision systems show the high interest of the community to perform perception tasks outside of the closed-vocabulary benchmark setups which have been so popular until now. Being able to discover objects in images/videos without knowing in advance what objects populate the dataset is an exciting prospect. But how to find objects without knowing anything about them? Recent works show that it is possible to perform class-agnostic unsupervised object localization by exploiting self-supervised pre-trained features. We propose here a survey of unsupervised object localization methods that discover objects in images without requiring any manual annotation in the era of self-supervised ViTs. We gather links of discussed methods in the repository https://github.com/valeoai/Awesome-Unsupervised-Object-Localization.
