Table of Contents
Fetching ...

DADO: A Depth-Attention framework for Object Discovery

Federico Gonzalez, Estefania Talavera, Petia Radeva

TL;DR

This work tackles unsupervised object discovery by jointly leveraging depth cues and attention signals. The proposed DADO framework fuses depth maps from a Dense Prediction Transformer with attention maps from DINO ViT, using an adaptive entropy-based weighting scheme to balance the contributions of each cue per image. Depth is organized into adaptive layers via histogram-based segmentation, and each layer is combined with the global attention map to generate candidate object regions, thresholded by a data-driven threshold and refined into bounding boxes with Soft-NMS. On VOC07/VOC12 benchmarks, DADO achieves competitiveCorLoc and odAP scores, supported by ablations that justify design choices such as using DINOv1 for spatial fidelity and introducing overlap between depth bins. The results suggest that integrating depth-aware segmentation with attention-based localization robustly enhances unsupervised object discovery in complex scenes with occlusion and overlapping objects.

Abstract

Unsupervised object discovery, the task of identifying and localizing objects in images without human-annotated labels, remains a significant challenge and a growing focus in computer vision. In this work, we introduce a novel model, DADO (Depth-Attention self-supervised technique for Discovering unseen Objects), which combines an attention mechanism and a depth model to identify potential objects in images. To address challenges such as noisy attention maps or complex scenes with varying depth planes, DADO employs dynamic weighting to adaptively emphasize attention or depth features based on the global characteristics of each image. We evaluated DADO on standard benchmarks, where it outperforms state-of-the-art methods in object discovery accuracy and robustness without the need for fine-tuning.

DADO: A Depth-Attention framework for Object Discovery

TL;DR

This work tackles unsupervised object discovery by jointly leveraging depth cues and attention signals. The proposed DADO framework fuses depth maps from a Dense Prediction Transformer with attention maps from DINO ViT, using an adaptive entropy-based weighting scheme to balance the contributions of each cue per image. Depth is organized into adaptive layers via histogram-based segmentation, and each layer is combined with the global attention map to generate candidate object regions, thresholded by a data-driven threshold and refined into bounding boxes with Soft-NMS. On VOC07/VOC12 benchmarks, DADO achieves competitiveCorLoc and odAP scores, supported by ablations that justify design choices such as using DINOv1 for spatial fidelity and introducing overlap between depth bins. The results suggest that integrating depth-aware segmentation with attention-based localization robustly enhances unsupervised object discovery in complex scenes with occlusion and overlapping objects.

Abstract

Unsupervised object discovery, the task of identifying and localizing objects in images without human-annotated labels, remains a significant challenge and a growing focus in computer vision. In this work, we introduce a novel model, DADO (Depth-Attention self-supervised technique for Discovering unseen Objects), which combines an attention mechanism and a depth model to identify potential objects in images. To address challenges such as noisy attention maps or complex scenes with varying depth planes, DADO employs dynamic weighting to adaptively emphasize attention or depth features based on the global characteristics of each image. We evaluated DADO on standard benchmarks, where it outperforms state-of-the-art methods in object discovery accuracy and robustness without the need for fine-tuning.

Paper Structure

This paper contains 15 sections, 9 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Our proposed DADO framework
  • Figure 2: Outputs of DADO. (a) Independent and isolated objects are effectively discovered by both attention mechanisms and depth cues. (b) Objects positioned in front of or behind others can be accurately separated using depth layers; in such cases, attention provides limited additional information. (c) Composite objects, such as the horse and rider, are very difficult to separate when they lie on the same plane—this represents the main weakness of our model. (d) Pascal VOC ground truth does not include the 'goat' class, but contains 'sheep'. DADO finds instances of both objects. (e) Separating two objects that are adjacent and on the same plane is particularly challenging for our model.