Table of Contents
Fetching ...

Search and Detect: Training-Free Long Tail Object Detection via Web-Image Retrieval

Mankeerat Sidhu, Hetarth Chopra, Ansel Blume, Jeonghwan Kim, Revanth Gangi Reddy, Heng Ji

TL;DR

This paper introduces SearchDet, a training-free long-tail object detection framework that significantly enhances open-vocabulary object detection performance and shows that the approach of basing object detection on a set of Web-retrieved exemplars is stable with respect to variations in the exemplars.

Abstract

In this paper, we introduce SearchDet, a training-free long-tail object detection framework that significantly enhances open-vocabulary object detection performance. SearchDet retrieves a set of positive and negative images of an object to ground, embeds these images, and computes an input image-weighted query which is used to detect the desired concept in the image. Our proposed method is simple and training-free, yet achieves over 48.7% mAP improvement on ODinW and 59.1% mAP improvement on LVIS compared to state-of-the-art models such as GroundingDINO. We further show that our approach of basing object detection on a set of Web-retrieved exemplars is stable with respect to variations in the exemplars, suggesting a path towards eliminating costly data annotation and training procedures.

Search and Detect: Training-Free Long Tail Object Detection via Web-Image Retrieval

TL;DR

This paper introduces SearchDet, a training-free long-tail object detection framework that significantly enhances open-vocabulary object detection performance and shows that the approach of basing object detection on a set of Web-retrieved exemplars is stable with respect to variations in the exemplars.

Abstract

In this paper, we introduce SearchDet, a training-free long-tail object detection framework that significantly enhances open-vocabulary object detection performance. SearchDet retrieves a set of positive and negative images of an object to ground, embeds these images, and computes an input image-weighted query which is used to detect the desired concept in the image. Our proposed method is simple and training-free, yet achieves over 48.7% mAP improvement on ODinW and 59.1% mAP improvement on LVIS compared to state-of-the-art models such as GroundingDINO. We further show that our approach of basing object detection on a set of Web-retrieved exemplars is stable with respect to variations in the exemplars, suggesting a path towards eliminating costly data annotation and training procedures.
Paper Structure (19 sections, 4 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 4 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Detection results for label "Mountain Dew". While the GroundingDINO, one of the state-of-the-art zero-shot object detection methods, fails to capture the Mountain Dew bottles in the image displayed in the figure, SearchDet manages to ground every instance of Mountain Dew that appears in the image. Detection results for other classes "Dog" and "Aerial Boat".
  • Figure 2: The entire architecture of our method. We compare the adjusted embeddings, produced by the DINOv2 model, of the positive and negative support images, with the relevant masks extracted using the SAM model to provide an initial estimate of our segmentation BBox. We again use DINOv2 for generating pixel-precise heatmaps which provide another estimate for the segmentation. We combine both these estimates using a binarized overlap to get the final segmentation mask.
  • Figure 3: Illustration of our method providing more fine-grained masks after including the negative support images. The negative query (here waves) helps our method, in a way, to not accidentally relevant areas, and only focus on areas represented by the positive query (here surfboard).
  • Figure 4: A comparison of our method's mAP on the OdinW Dataset under different concept names. We see a 3.85% increase in the mean mAP just by including the name of the dataset (for example WildfireSmoke) with the name of the concept in the image.
  • Figure 5: Stability analysis showcasing the cosine similarity of embeddings generated from the positive and negative support images (ten images of each), averaged across all eighty classes in the COCO dataset. The high similarity scores demonstrate the stability of our method, which exhibits consistent patterns in embedding similarities despite the dynamic nature of web-based image retrieval.
  • ...and 1 more figures