DADO: A Depth-Attention framework for Object Discovery
Federico Gonzalez, Estefania Talavera, Petia Radeva
TL;DR
This work tackles unsupervised object discovery by jointly leveraging depth cues and attention signals. The proposed DADO framework fuses depth maps from a Dense Prediction Transformer with attention maps from DINO ViT, using an adaptive entropy-based weighting scheme to balance the contributions of each cue per image. Depth is organized into adaptive layers via histogram-based segmentation, and each layer is combined with the global attention map to generate candidate object regions, thresholded by a data-driven threshold and refined into bounding boxes with Soft-NMS. On VOC07/VOC12 benchmarks, DADO achieves competitiveCorLoc and odAP scores, supported by ablations that justify design choices such as using DINOv1 for spatial fidelity and introducing overlap between depth bins. The results suggest that integrating depth-aware segmentation with attention-based localization robustly enhances unsupervised object discovery in complex scenes with occlusion and overlapping objects.
Abstract
Unsupervised object discovery, the task of identifying and localizing objects in images without human-annotated labels, remains a significant challenge and a growing focus in computer vision. In this work, we introduce a novel model, DADO (Depth-Attention self-supervised technique for Discovering unseen Objects), which combines an attention mechanism and a depth model to identify potential objects in images. To address challenges such as noisy attention maps or complex scenes with varying depth planes, DADO employs dynamic weighting to adaptively emphasize attention or depth features based on the global characteristics of each image. We evaluated DADO on standard benchmarks, where it outperforms state-of-the-art methods in object discovery accuracy and robustness without the need for fine-tuning.
