Table of Contents
Fetching ...

Localizing Objects with Self-Supervised Transformers and no Labels

Oriane Siméoni, Gilles Puy, Huy V. Vo, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, Jean Ponce

TL;DR

LOST addresses unsupervised object localization by exploiting patch-level keys from a self-supervised vision transformer (DINO). It localizes a seed patch with minimal cross-patch correlations, expands to related patches, and extracts a bounding box, all without any labeled data. The method yields state-of-the-art CorLoc on VOC07/12, enables unsupervised class-agnostic and class-aware detectors trained purely on pseudo-labels, and demonstrates competitive unsupervised detection results across multiple datasets. By operating on a single image with linear complexity, LOST offers a scalable approach to object localization that can bootstrap downstream unsupervised detection and categorization efforts in real-world pipelines.

Abstract

Localizing objects in image collections without supervision can help to avoid expensive annotation campaigns. We propose a simple approach to this problem, that leverages the activation features of a vision transformer pre-trained in a self-supervised manner. Our method, LOST, does not require any external object proposal nor any exploration of the image collection; it operates on a single image. Yet, we outperform state-of-the-art object discovery methods by up to 8 CorLoc points on PASCAL VOC 2012. We also show that training a class-agnostic detector on the discovered objects boosts results by another 7 points. Moreover, we show promising results on the unsupervised object discovery task. The code to reproduce our results can be found at https://github.com/valeoai/LOST.

Localizing Objects with Self-Supervised Transformers and no Labels

TL;DR

LOST addresses unsupervised object localization by exploiting patch-level keys from a self-supervised vision transformer (DINO). It localizes a seed patch with minimal cross-patch correlations, expands to related patches, and extracts a bounding box, all without any labeled data. The method yields state-of-the-art CorLoc on VOC07/12, enables unsupervised class-agnostic and class-aware detectors trained purely on pseudo-labels, and demonstrates competitive unsupervised detection results across multiple datasets. By operating on a single image with linear complexity, LOST offers a scalable approach to object localization that can bootstrap downstream unsupervised detection and categorization efforts in real-world pipelines.

Abstract

Localizing objects in image collections without supervision can help to avoid expensive annotation campaigns. We propose a simple approach to this problem, that leverages the activation features of a vision transformer pre-trained in a self-supervised manner. Our method, LOST, does not require any external object proposal nor any exploration of the image collection; it operates on a single image. Yet, we outperform state-of-the-art object discovery methods by up to 8 CorLoc points on PASCAL VOC 2012. We also show that training a class-agnostic detector on the discovered objects boosts results by another 7 points. Moreover, we show promising results on the unsupervised object discovery task. The code to reproduce our results can be found at https://github.com/valeoai/LOST.

Paper Structure

This paper contains 46 sections, 5 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Three applications of LOST to unsupervised single-object discovery (left), multi-object discovery (middle) and object detection (right). In the latter case, objects discovered by LOST are clustered into categories, and cluster labels are used to train a classical object detector. Although large image collections are used to train the underlying image representation caron2021emerging and the detector shaoqing2015faster, no annotation is ever used in the pipeline. See \ref{['fig:similars']} and Tables\ref{['tab:training']}, \ref{['tab:training-class-ap']} for more experiments.
  • Figure 2: Initial seed, patch similarities and patch degrees. Top: images from Pascal VOC2007. Middle: initial seed $p^*$ (in red) and patches similar to $p^*$ (in grey), i.e., such that $\mathbf{f}_p^{{^\top}} \mathbf{f}_q \geq 0$ hence $a_{p^*q}=1$. Bottom: map of inverse degrees $1/d_p$ of all patches $p$ (yellow to blue, for low to high degrees). The initial seed $p^*$ is the patch with the lowest degree. Figure is best viewed in color.
  • Figure 3: Object localizations on VOC07. The red square represents the seed $p^*$, the yellow box is the box obtained using only the seed $p^*$, and the purple box is the box obtained using all the seeds $\mathcal{S}$.
  • Figure 4: Object localization on VOC07. The red square represents the seed $p^*$, the yellow bos is the box obtained using only the seed $p^*$, and the purple box is the box obtained using all the seeds $\mathcal{S}$ with $k=100$.
  • Figure 5: Cases of localization failure on VOC07. The red square represents the seed $p^*$, the yellow box is the box obtained using only the seed $p^*$, and the purple box is the box obtained using all the seeds $\mathcal{S}$ with $k=100$.
  • ...and 4 more figures