Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head
Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee
TL;DR
This work tackles real-time open-vocabulary object detection by analyzing bottlenecks in DETR-based OV detectors and introducing OmDet-Turbo, which uses an Efficient Fusion Head (EFH) comprised of a language-aware encoder (ELA-Encoder) and decoder (ELA-Decoder) to dramatically reduce multimodal fusion cost. A decoupled prompt/label encoding scheme, plus a language-cache mechanism, enables fast inference and supports multi-task pre-training across OD, grounding, VQA, and HOI tasks, with a training regimen that includes IoU-aware query selection and a mix of L1, GIoU, and denoising losses. Empirically, OmDet-Turbo-Base achieves state-of-the-art zero-shot results on ODinW and OVDEval, and excellent COCO/LVIS performance, while reaching 100.2 FPS with TensorRT and maintaining strong open-vocabulary capabilities. These results demonstrate that end-to-end transformer-based OV detectors can attain industrially relevant throughput without sacrificing accuracy, making real-time deployment feasible in real-world applications.
Abstract
End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities. However, their demanding computational requirements have hindered their practical application in real-time object detection (OD) scenarios. In this paper, we scrutinize the limitations of two leading models in the OVDEval benchmark, OmDet and Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based real-time OVD model features an innovative Efficient Fusion Head (EFH) module designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO. Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with TensorRT and language cache techniques applied. Notably, in zero-shot scenarios on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on par with current state-of-the-art supervised models. Furthermore, it establishes new state-of-the-art benchmarks on ODinW and OVDEval, boasting an AP of 30.1 and an NMS-AP of 26.86, respectively. The practicality of OmDet-Turbo in industrial applications is underscored by its exceptional performance on benchmark datasets and superior inference speed, positioning it as a compelling choice for real-time object detection tasks. Code: \url{https://github.com/om-ai-lab/OmDet}
