PEEKABOO: Hiding parts of an image for unsupervised object localization
Hasib Zunair, A. Ben Hamza
TL;DR
PEEKABOO addresses unsupervised object localization by masking parts of images to learn context-based pixel- and shape-level representations without labels. It combines a frozen DINO-based encoder with a lightweight segmenter, a Masked Feature Predictor, and a Predictor Consistency Loss, optimizing a total loss $L_{total}$ to align masked and unmasked predictions. Across six benchmarks, PEEKABOO achieves competitive results with far fewer learnable parameters and without test-time training, outperforming many training-free methods and several training-based baselines, and showing robustness to small, reflective, or cluttered scenes. This approach offers a practical, efficient pathway for real-world localization tasks where labeled data are scarce or unavailable.
Abstract
Localizing objects in an unsupervised manner poses significant challenges due to the absence of key visual information such as the appearance, type and number of objects, as well as the lack of labeled object classes typically available in supervised settings. While recent approaches to unsupervised object localization have demonstrated significant progress by leveraging self-supervised visual representations, they often require computationally intensive training processes, resulting in high resource demands in terms of computation, learnable parameters, and data. They also lack explicit modeling of visual context, potentially limiting their accuracy in object localization. To tackle these challenges, we propose a single-stage learning framework, dubbed PEEKABOO, for unsupervised object localization by learning context-based representations at both the pixel- and shape-level of the localized objects through image masking. The key idea is to selectively hide parts of an image and leverage the remaining image information to infer the location of objects without explicit supervision. The experimental results, both quantitative and qualitative, across various benchmark datasets, demonstrate the simplicity, effectiveness and competitive performance of our approach compared to state-of-the-art methods in both single object discovery and unsupervised salient object detection tasks. Code and pre-trained models are available at: https://github.com/hasibzunair/peekaboo
