Table of Contents
Fetching ...

Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey

Oriane Siméoni, Éloi Zablocki, Spyros Gidaris, Gilles Puy, Patrick Pérez

TL;DR

This paper surveys unsupervised object localization in the era of self-supervised ViTs, focusing on class-agnostic discovery without manual annotations. It categorizes tasks and metrics, reviews training-free localization methods that exploit patch correlations and graph-based techniques, and discusses training-with-pseudo-labels approaches that leverage pseudo-supervision to improve multi-object detection and instance segmentation. It highlights key methods (eg, LOST, TokenCut, MOVE, UMOD, MOST, FOUND) and the crucial role of self-supervised features (notably DINO) in enabling localization, while also addressing post-processing and fine-tuning strategies that boost performance. The survey concludes with limitations and future directions, including open-vocabulary extensions, multimodal signals, scene-centric data, and learning object-centric representations to advance robust, class-agnostic localization in real-world settings.

Abstract

The recent enthusiasm for open-world vision systems show the high interest of the community to perform perception tasks outside of the closed-vocabulary benchmark setups which have been so popular until now. Being able to discover objects in images/videos without knowing in advance what objects populate the dataset is an exciting prospect. But how to find objects without knowing anything about them? Recent works show that it is possible to perform class-agnostic unsupervised object localization by exploiting self-supervised pre-trained features. We propose here a survey of unsupervised object localization methods that discover objects in images without requiring any manual annotation in the era of self-supervised ViTs. We gather links of discussed methods in the repository https://github.com/valeoai/Awesome-Unsupervised-Object-Localization.

Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey

TL;DR

This paper surveys unsupervised object localization in the era of self-supervised ViTs, focusing on class-agnostic discovery without manual annotations. It categorizes tasks and metrics, reviews training-free localization methods that exploit patch correlations and graph-based techniques, and discusses training-with-pseudo-labels approaches that leverage pseudo-supervision to improve multi-object detection and instance segmentation. It highlights key methods (eg, LOST, TokenCut, MOVE, UMOD, MOST, FOUND) and the crucial role of self-supervised features (notably DINO) in enabling localization, while also addressing post-processing and fine-tuning strategies that boost performance. The survey concludes with limitations and future directions, including open-vocabulary extensions, multimodal signals, scene-centric data, and learning object-centric representations to advance robust, class-agnostic localization in real-world settings.

Abstract

The recent enthusiasm for open-world vision systems show the high interest of the community to perform perception tasks outside of the closed-vocabulary benchmark setups which have been so popular until now. Being able to discover objects in images/videos without knowing in advance what objects populate the dataset is an exciting prospect. But how to find objects without knowing anything about them? Recent works show that it is possible to perform class-agnostic unsupervised object localization by exploiting self-supervised pre-trained features. We propose here a survey of unsupervised object localization methods that discover objects in images without requiring any manual annotation in the era of self-supervised ViTs. We gather links of discussed methods in the repository https://github.com/valeoai/Awesome-Unsupervised-Object-Localization.
Paper Structure (47 sections, 7 equations, 11 figures, 10 tables)

This paper contains 47 sections, 7 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Evolution of the number of papers on unsupervised object localization. Histogram of the number of papers mentioning "unsupervised object detection/segmentation/localization" in their title per year, from 2000 to 2023. Data captured by querying dblp.org paper repository.
  • Figure 2: Performance evolution in unsupervised object localization. Evolution of the CorLoc score (more details in \ref{['sec:background:def:single-object']}) evaluated on VOC07 dataset in the last three years. In purple are methods including a self-training stage, when green solely exploit frozen self-supervised features. Gray squares show previous baselines doing dataset level optimization. Results have gained more than 20 pts in 2 years with simpler/faster methods which exploit self-supervised features. 'Bb ref.[34]$^{++}$' corresponds to the combination of gomel2023boxbasedrefinement with MOVE bielski2022move and the training of a class-agnostic detector following simeoni2021lost.
  • Figure 3: The different tasks to evaluate unsupervised object localization methods: (a) unsupervised saliency detection focuses on foreground/background separation, (b) single-object discovery requires to localize well at least a single object with a box, (c) class-agnostic multi-object detection evaluates if all objects have been well detected with good boxes and (d) class-agnostic instance segmentation is the analogue with instance masks.
  • Figure 4: Object localization using DINO's last attention layer. Visualization of different features extracted from the last attention layer of DINO caron2021dino for the original image (c): (a) CLS attention maps generated with all heads; (b) Correlation between a patch of interest (in red) and all other patches given the key features of the last MSA layer; (d-e) Inverse degree matrix and second eigenvector, which are used to extract finer localization information simeoni2021lostwang2022tokencut (more details in \ref{['sec:no-training:degree']}, \ref{['sec:no-training:clustering']}).
  • Figure 5: Feature similarity graph for unsupervised object localization. A similarity graph $\mathcal{G}$ among patches of an image is built and used by unsupervised object localization methods simeoni2021lostwang2022tokencutshin2022selfmaskwang2023cutlersimeoni2023found.
  • ...and 6 more figures