Table of Contents
Fetching ...

OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

Leilei Wang, Longfei Liu, Xi Shen, Xuanlong Yu, Ying Tiffany He, Fei Richard Yu, Yingyi Chen

TL;DR

OV-DEIM is presented, an end-to-end DETR-style open-vocabulary detector built upon the recent DEIMv2 framework with integrated vision-language modeling for efficient open-vocabulary inference and introduces GridSynthetic, a simple yet effective data augmentation strategy that composes multiple training samples into structured image grids.

Abstract

Real-time open-vocabulary object detection (OVOD) is essential for practical deployment in dynamic environments, where models must recognize a large and evolving set of categories under strict latency constraints. Current real-time OVOD methods are predominantly built upon YOLO-style models. In contrast, real-time DETR-based methods still lag behind in terms of inference latency, model lightweightness, and overall performance. In this work, we present OV-DEIM, an end-to-end DETR-style open-vocabulary detector built upon the recent DEIMv2 framework with integrated vision-language modeling for efficient open-vocabulary inference. We further introduce a simple query supplement strategy that improves Fixed AP without compromising inference speed. Beyond architectural improvements, we introduce GridSynthetic, a simple yet effective data augmentation strategy that composes multiple training samples into structured image grids. By exposing the model to richer object co-occurrence patterns and spatial layouts within a single forward pass, GridSynthetic mitigates the negative impact of noisy localization signals on the classification loss and improves semantic discrimination, particularly for rare categories. Extensive experiments demonstrate that OV-DEIM achieves state-of-the-art performance on open-vocabulary detection benchmarks, delivering superior efficiency and notable improvements on challenging rare categories. Code and pretrained models are available at https://github.com/wleilei/OV-DEIM.

OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

TL;DR

OV-DEIM is presented, an end-to-end DETR-style open-vocabulary detector built upon the recent DEIMv2 framework with integrated vision-language modeling for efficient open-vocabulary inference and introduces GridSynthetic, a simple yet effective data augmentation strategy that composes multiple training samples into structured image grids.

Abstract

Real-time open-vocabulary object detection (OVOD) is essential for practical deployment in dynamic environments, where models must recognize a large and evolving set of categories under strict latency constraints. Current real-time OVOD methods are predominantly built upon YOLO-style models. In contrast, real-time DETR-based methods still lag behind in terms of inference latency, model lightweightness, and overall performance. In this work, we present OV-DEIM, an end-to-end DETR-style open-vocabulary detector built upon the recent DEIMv2 framework with integrated vision-language modeling for efficient open-vocabulary inference. We further introduce a simple query supplement strategy that improves Fixed AP without compromising inference speed. Beyond architectural improvements, we introduce GridSynthetic, a simple yet effective data augmentation strategy that composes multiple training samples into structured image grids. By exposing the model to richer object co-occurrence patterns and spatial layouts within a single forward pass, GridSynthetic mitigates the negative impact of noisy localization signals on the classification loss and improves semantic discrimination, particularly for rare categories. Extensive experiments demonstrate that OV-DEIM achieves state-of-the-art performance on open-vocabulary detection benchmarks, delivering superior efficiency and notable improvements on challenging rare categories. Code and pretrained models are available at https://github.com/wleilei/OV-DEIM.
Paper Structure (37 sections, 8 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 37 sections, 8 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Architecture of the OV-DEIM Framework. Given an image $\bm{I}$, the backbone and hybrid encoder extract flattened multi-scale visual features $\bm{F_I}$, which are processed by the vision-text alignment head (CLS) to produce similarity scores and into the bounding box regression head to predict object locations. Given text prompts, the text encoder maps them into textual embeddings $\bm{F_T}$, which are used to guide query selection and vision-text alignment. Through the text-aware query selection, the top-ranked queries $Q_{\text{top}}$ are fed into the decoder for iterative refinement, while $Q_{\text{dn}}$ are used for the denoising loss denoising and $Q_{\text{sup}}$ serve as additional queries for computing Fixed AP FixedAP. Each query comprises both a feature embedding and location information.
  • Figure 2: Effectiveness of GridSynthetic. The EMA-smoothed GIoU loss curves show that GridSynthetic consistently achieves the lowest loss throughout training, alleviating the difficulty of localization and leading to more sufficient classification supervision and cleaner semantic alignment.
  • Figure 3: Visualizations of Zero-shot Inference on LVIS lvis. We employ the pretrained OV-DEIM-L model and perform inference with the LVIS vocabulary containing 1,203 categories, only using a confidence threshold of 0.5.