Table of Contents
Fetching ...

PEEKABOO: Hiding parts of an image for unsupervised object localization

Hasib Zunair, A. Ben Hamza

TL;DR

PEEKABOO addresses unsupervised object localization by masking parts of images to learn context-based pixel- and shape-level representations without labels. It combines a frozen DINO-based encoder with a lightweight segmenter, a Masked Feature Predictor, and a Predictor Consistency Loss, optimizing a total loss $L_{total}$ to align masked and unmasked predictions. Across six benchmarks, PEEKABOO achieves competitive results with far fewer learnable parameters and without test-time training, outperforming many training-free methods and several training-based baselines, and showing robustness to small, reflective, or cluttered scenes. This approach offers a practical, efficient pathway for real-world localization tasks where labeled data are scarce or unavailable.

Abstract

Localizing objects in an unsupervised manner poses significant challenges due to the absence of key visual information such as the appearance, type and number of objects, as well as the lack of labeled object classes typically available in supervised settings. While recent approaches to unsupervised object localization have demonstrated significant progress by leveraging self-supervised visual representations, they often require computationally intensive training processes, resulting in high resource demands in terms of computation, learnable parameters, and data. They also lack explicit modeling of visual context, potentially limiting their accuracy in object localization. To tackle these challenges, we propose a single-stage learning framework, dubbed PEEKABOO, for unsupervised object localization by learning context-based representations at both the pixel- and shape-level of the localized objects through image masking. The key idea is to selectively hide parts of an image and leverage the remaining image information to infer the location of objects without explicit supervision. The experimental results, both quantitative and qualitative, across various benchmark datasets, demonstrate the simplicity, effectiveness and competitive performance of our approach compared to state-of-the-art methods in both single object discovery and unsupervised salient object detection tasks. Code and pre-trained models are available at: https://github.com/hasibzunair/peekaboo

PEEKABOO: Hiding parts of an image for unsupervised object localization

TL;DR

PEEKABOO addresses unsupervised object localization by masking parts of images to learn context-based pixel- and shape-level representations without labels. It combines a frozen DINO-based encoder with a lightweight segmenter, a Masked Feature Predictor, and a Predictor Consistency Loss, optimizing a total loss to align masked and unmasked predictions. Across six benchmarks, PEEKABOO achieves competitive results with far fewer learnable parameters and without test-time training, outperforming many training-free methods and several training-based baselines, and showing robustness to small, reflective, or cluttered scenes. This approach offers a practical, efficient pathway for real-world localization tasks where labeled data are scarce or unavailable.

Abstract

Localizing objects in an unsupervised manner poses significant challenges due to the absence of key visual information such as the appearance, type and number of objects, as well as the lack of labeled object classes typically available in supervised settings. While recent approaches to unsupervised object localization have demonstrated significant progress by leveraging self-supervised visual representations, they often require computationally intensive training processes, resulting in high resource demands in terms of computation, learnable parameters, and data. They also lack explicit modeling of visual context, potentially limiting their accuracy in object localization. To tackle these challenges, we propose a single-stage learning framework, dubbed PEEKABOO, for unsupervised object localization by learning context-based representations at both the pixel- and shape-level of the localized objects through image masking. The key idea is to selectively hide parts of an image and leverage the remaining image information to infer the location of objects without explicit supervision. The experimental results, both quantitative and qualitative, across various benchmark datasets, demonstrate the simplicity, effectiveness and competitive performance of our approach compared to state-of-the-art methods in both single object discovery and unsupervised salient object detection tasks. Code and pre-trained models are available at: https://github.com/hasibzunair/peekaboo
Paper Structure (13 sections, 1 equation, 6 figures, 2 tables)

This paper contains 13 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of PEEKABOO framework for unsupervised object localization. The proposed learning paradigm consists of an Unsupervised Segmentor, a Masked Feature Predictor and a Predictor Consistency Loss. Here, $f_{\boldsymbol{\theta}}$ is a frozen Self-distillation with No Labels (DINO) encoder caron2021emerging paired with a lightweight trainable $1\times 1$ convolutional layer decoder having only 770 learnable parameters. The two branches are identical and share weights. After training, $f_{\boldsymbol{\theta}}$ is utilized to generate class-agnostic predicted segmentation masks.
  • Figure 2: Visual comparison of PEEKABOO and state-of-the-art FOUND simeoni2023unsupervisedd on ECSSD, DUT-OMRON and DUTS-TE datasets. Across all datasets, PEEKABOO excels in localizing salient objects, particularly when they are small, reflective, or situated against complex or dimly illuminated backgrounds. Zoom in to observe the results more closely.
  • Figure 3: Ablation study of different modules of PEEKABOO (left) and impact of masking (right) using VOC07, VOC12 and COCO20K datasets. MFP and PCL consistently help improve performance. PEEKABOO with high masked pixels yields better performance.
  • Figure 4: Visualization of failure cases of PEEKABOO on DUT-OMRON for object localization. No refinement step is applied. Zoom in to observe the results more closely.
  • Figure 5: Visualization of masks during training in PEEKABOO. Some masks cover more than 50% of the image. Images are from Irregular Masks Dataset liu2018image after applying binary thresholding.
  • ...and 1 more figures