Table of Contents
Fetching ...

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee

TL;DR

This work tackles real-time open-vocabulary object detection by analyzing bottlenecks in DETR-based OV detectors and introducing OmDet-Turbo, which uses an Efficient Fusion Head (EFH) comprised of a language-aware encoder (ELA-Encoder) and decoder (ELA-Decoder) to dramatically reduce multimodal fusion cost. A decoupled prompt/label encoding scheme, plus a language-cache mechanism, enables fast inference and supports multi-task pre-training across OD, grounding, VQA, and HOI tasks, with a training regimen that includes IoU-aware query selection and a mix of L1, GIoU, and denoising losses. Empirically, OmDet-Turbo-Base achieves state-of-the-art zero-shot results on ODinW and OVDEval, and excellent COCO/LVIS performance, while reaching 100.2 FPS with TensorRT and maintaining strong open-vocabulary capabilities. These results demonstrate that end-to-end transformer-based OV detectors can attain industrially relevant throughput without sacrificing accuracy, making real-time deployment feasible in real-world applications.

Abstract

End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities. However, their demanding computational requirements have hindered their practical application in real-time object detection (OD) scenarios. In this paper, we scrutinize the limitations of two leading models in the OVDEval benchmark, OmDet and Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based real-time OVD model features an innovative Efficient Fusion Head (EFH) module designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO. Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with TensorRT and language cache techniques applied. Notably, in zero-shot scenarios on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on par with current state-of-the-art supervised models. Furthermore, it establishes new state-of-the-art benchmarks on ODinW and OVDEval, boasting an AP of 30.1 and an NMS-AP of 26.86, respectively. The practicality of OmDet-Turbo in industrial applications is underscored by its exceptional performance on benchmark datasets and superior inference speed, positioning it as a compelling choice for real-time object detection tasks. Code: \url{https://github.com/om-ai-lab/OmDet}

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

TL;DR

This work tackles real-time open-vocabulary object detection by analyzing bottlenecks in DETR-based OV detectors and introducing OmDet-Turbo, which uses an Efficient Fusion Head (EFH) comprised of a language-aware encoder (ELA-Encoder) and decoder (ELA-Decoder) to dramatically reduce multimodal fusion cost. A decoupled prompt/label encoding scheme, plus a language-cache mechanism, enables fast inference and supports multi-task pre-training across OD, grounding, VQA, and HOI tasks, with a training regimen that includes IoU-aware query selection and a mix of L1, GIoU, and denoising losses. Empirically, OmDet-Turbo-Base achieves state-of-the-art zero-shot results on ODinW and OVDEval, and excellent COCO/LVIS performance, while reaching 100.2 FPS with TensorRT and maintaining strong open-vocabulary capabilities. These results demonstrate that end-to-end transformer-based OV detectors can attain industrially relevant throughput without sacrificing accuracy, making real-time deployment feasible in real-world applications.

Abstract

End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities. However, their demanding computational requirements have hindered their practical application in real-time object detection (OD) scenarios. In this paper, we scrutinize the limitations of two leading models in the OVDEval benchmark, OmDet and Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based real-time OVD model features an innovative Efficient Fusion Head (EFH) module designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO. Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with TensorRT and language cache techniques applied. Notably, in zero-shot scenarios on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on par with current state-of-the-art supervised models. Furthermore, it establishes new state-of-the-art benchmarks on ODinW and OVDEval, boasting an AP of 30.1 and an NMS-AP of 26.86, respectively. The practicality of OmDet-Turbo in industrial applications is underscored by its exceptional performance on benchmark datasets and superior inference speed, positioning it as a compelling choice for real-time object detection tasks. Code: \url{https://github.com/om-ai-lab/OmDet}
Paper Structure (13 sections, 4 equations, 3 figures, 3 tables)

This paper contains 13 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Module-wise speed comparison between the proposed OmDet-Turbo with prior state-of-the-art methods OmDet and Grounding-DINO. (Tested on A100 with PyTorch Implementation.)
  • Figure 2: Model Architecture of OmDet-Turbo.
  • Figure 3: The procedure of our multi-task learning. Initially, the annotated datasets from different tasks are converted into a VQA format. This conversion process involves generating prompts and labels from the original annotations of each task. During training, the converted prompts and labels from various tasks are randomly selected to compose a batch. These prompts and labels are then paired with their respective images. The composed batch, consisting of prompts, labels, and images from different tasks, is fed into the OmDet-Turbo model for training.