Table of Contents
Fetching ...

OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network

Tiancheng Zhao, Peng Liu, Kyusong Lee

TL;DR

OmDet introduces a language-aware object-detection framework that continuously accumulates a visual vocabulary by training on numerous OD datasets, enabling open-vocabulary detection and phrase grounding. The coreMultimodal Detection Network (MDN) performs deep fusion of vision and language via latent queries, conditioned on task prompts, and is trained with a set-prediction Hungarian loss. Large-scale pre-training across 104 OD datasets (and grounding data in OmDetV2) demonstrates strong open-world performance, state-of-the-art results on ODinW and OV benchmarks, and effective knowledge sharing across tasks. The results indicate scalable benefits from expanding the visual vocabulary, with robust zero/few-shot gains and competitive phrase grounding efficiency, setting a foundation for future growth in vocabulary size and pre-training strategies.

Abstract

The advancement of object detection (OD) in open-vocabulary and open-world scenarios is a critical challenge in computer vision. This work introduces OmDet, a novel language-aware object detection architecture, and an innovative training mechanism that harnesses continual learning and multi-dataset vision-language pre-training. Leveraging natural language as a universal knowledge representation, OmDet accumulates a "visual vocabulary" from diverse datasets, unifying the task as a language-conditioned detection framework. Our multimodal detection network (MDN) overcomes the challenges of multi-dataset joint training and generalizes to numerous training datasets without manual label taxonomy merging. We demonstrate superior performance of OmDet over strong baselines in object detection in the wild, open-vocabulary detection, and phrase grounding, achieving state-of-the-art results. Ablation studies reveal the impact of scaling the pre-training visual vocabulary, indicating a promising direction for further expansion to larger datasets. The effectiveness of our deep fusion approach is underscored by its ability to learn jointly from multiple datasets, enhancing performance through knowledge sharing.

OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network

TL;DR

OmDet introduces a language-aware object-detection framework that continuously accumulates a visual vocabulary by training on numerous OD datasets, enabling open-vocabulary detection and phrase grounding. The coreMultimodal Detection Network (MDN) performs deep fusion of vision and language via latent queries, conditioned on task prompts, and is trained with a set-prediction Hungarian loss. Large-scale pre-training across 104 OD datasets (and grounding data in OmDetV2) demonstrates strong open-world performance, state-of-the-art results on ODinW and OV benchmarks, and effective knowledge sharing across tasks. The results indicate scalable benefits from expanding the visual vocabulary, with robust zero/few-shot gains and competitive phrase grounding efficiency, setting a foundation for future growth in vocabulary size and pre-training strategies.

Abstract

The advancement of object detection (OD) in open-vocabulary and open-world scenarios is a critical challenge in computer vision. This work introduces OmDet, a novel language-aware object detection architecture, and an innovative training mechanism that harnesses continual learning and multi-dataset vision-language pre-training. Leveraging natural language as a universal knowledge representation, OmDet accumulates a "visual vocabulary" from diverse datasets, unifying the task as a language-conditioned detection framework. Our multimodal detection network (MDN) overcomes the challenges of multi-dataset joint training and generalizes to numerous training datasets without manual label taxonomy merging. We demonstrate superior performance of OmDet over strong baselines in object detection in the wild, open-vocabulary detection, and phrase grounding, achieving state-of-the-art results. Ablation studies reveal the impact of scaling the pre-training visual vocabulary, indicating a promising direction for further expansion to larger datasets. The effectiveness of our deep fusion approach is underscored by its ability to learn jointly from multiple datasets, enhancing performance through knowledge sharing.
Paper Structure (25 sections, 2 equations, 6 figures, 12 tables)

This paper contains 25 sections, 2 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Overview of OmDet Architecture. The proposed Multimodal Detection Network iteratively fuses vision and language features into latent queries for object detection.
  • Figure 2: Network architecture for the Multimodal Detection Network (MDN), simplified here for illustration purposes.
  • Figure 3: Comparison with other frameworks. (a) Shallow fusion that only utilizes text information for object classification. (b) Deep fusion that fuses visual and text in the backbone before entering the object detection head. (c) Deep latent fusion (ours) utilizes latent queries to fuse multimodal information, enabling adaption to any query-based OD architecture.
  • Figure 4: Inference results on COCO, showing that OmDet-shallow suffers from fore/background inconsistency (missing objects), while OmDet does not.
  • Figure 5: Vocabulary size used in pre-training vs. the AP score of fine-tuning on ODinW with head-only and prompt tuning.. X-axis is in log-scale.
  • ...and 1 more figures