An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Xiangyu Zhao; Yicheng Chen; Shilin Xu; Xiangtai Li; Xinjiang Wang; Yining Li; Haian Huang

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang

TL;DR

The paper tackles the challenge of unified grounding across OVD, PG, and REC by introducing MM-Grounding-DINO, an open-source baseline built on Grounding-DINO and MMDetection and pretrained on a broad array of vision datasets. The approach preserves the core Grounding-DINO architecture while incorporating a bias-initialized contrastive embedding and extensive cross-dataset pretraining, enabling strong zero-shot and fine-tuned performance across COCO, LVIS, ODinW, and REC benchmarks. Key contributions include open-source release, comprehensive dataset preparation for three grounding tasks, and detailed experiments demonstrating robust transferability and improvements over the original baseline. The work offers a practically valuable resource that enhances reproducibility and supports broad research and application in open-set grounding and detection.

Abstract

Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community. Codes and trained models are released at https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino.

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

TL;DR

Abstract

Paper Structure (21 sections, 6 figures, 15 tables, 2 algorithms)

This paper contains 21 sections, 6 figures, 15 tables, 2 algorithms.

Introduction
Approach
Model
Datasets Preparation
Training Settings
Main Results
Zero-shot Transfer
Analysis for GRIT
Validation through Fine-tuning
Fine-tuning on COCO/LVIS
Fine-tuning on REC
Fine-tuning on Downstream Tasks
Conclusion
More Results
Detailed Results on gRefCOCO
...and 6 more sections

Figures (6)

Figure 1: (a)Open-Vocabulary Detection(OVD). (b)Phrase Grounding(PG). (c)Referring Expression Comprehension(REC).
Figure 2: Results on various benchmarks. MM-Grounding-DINO outperforms other grounding models on a broad range of tasks.
Figure 3: Illustration of MM-Grounding-DINO. Given image and text description, a text backbone and an image backbone are first utilized to extract text and image features respectively. The images and text features are then fed into the feature enhancer module to perform deep cross-modality fusion. After fusing, a language-guided query selection module is employed to extract cross-modality queries from the image features. These cross-modality queries are subsequently inputted into a cross-modality decoder, which is designed to probe the desired features from the two modal features. The output queries generated by the final layer of the decoder are then utilized for the prediction of object boxes and corresponding phrases
Figure 4: Visualization of Pre-training Datasets. The first row displays images sourced from the GQA dataset, while the second row displays two images from the GRIT dataset.
Figure 5: Comparison between ground-truth annotation and model's prediction. Concerning the 'girl' object, the prediction generated by MM-Grounding-DINO(right) appears to be more precise in contrast to the ground-truth annotations(left).
...and 1 more figures

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

TL;DR

Abstract

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)