OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

Leilei Wang; Longfei Liu; Xi Shen; Xuanlong Yu; Ying Tiffany He; Fei Richard Yu; Yingyi Chen

OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

Leilei Wang, Longfei Liu, Xi Shen, Xuanlong Yu, Ying Tiffany He, Fei Richard Yu, Yingyi Chen

TL;DR

OV-DEIM is presented, an end-to-end DETR-style open-vocabulary detector built upon the recent DEIMv2 framework with integrated vision-language modeling for efficient open-vocabulary inference and introduces GridSynthetic, a simple yet effective data augmentation strategy that composes multiple training samples into structured image grids.

Abstract

Real-time open-vocabulary object detection (OVOD) is essential for practical deployment in dynamic environments, where models must recognize a large and evolving set of categories under strict latency constraints. Current real-time OVOD methods are predominantly built upon YOLO-style models. In contrast, real-time DETR-based methods still lag behind in terms of inference latency, model lightweightness, and overall performance. In this work, we present OV-DEIM, an end-to-end DETR-style open-vocabulary detector built upon the recent DEIMv2 framework with integrated vision-language modeling for efficient open-vocabulary inference. We further introduce a simple query supplement strategy that improves Fixed AP without compromising inference speed. Beyond architectural improvements, we introduce GridSynthetic, a simple yet effective data augmentation strategy that composes multiple training samples into structured image grids. By exposing the model to richer object co-occurrence patterns and spatial layouts within a single forward pass, GridSynthetic mitigates the negative impact of noisy localization signals on the classification loss and improves semantic discrimination, particularly for rare categories. Extensive experiments demonstrate that OV-DEIM achieves state-of-the-art performance on open-vocabulary detection benchmarks, delivering superior efficiency and notable improvements on challenging rare categories. Code and pretrained models are available at https://github.com/wleilei/OV-DEIM.

OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

TL;DR

Abstract

Paper Structure (37 sections, 8 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 37 sections, 8 equations, 3 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Open-Vocabulary Object Detection (OVOD)
Data Augmentations in Object Detection
Background
DETR-Style Detectors
Architecture
Training Process
Inference Process
Open-Vocabulary Object Detection
Method
Architecture of the OV-DEIM Framework
Training Objective
Inference
Text Encoder
...and 22 more sections

Figures (3)

Figure 1: Architecture of the OV-DEIM Framework. Given an image $\bm{I}$, the backbone and hybrid encoder extract flattened multi-scale visual features $\bm{F_I}$, which are processed by the vision-text alignment head (CLS) to produce similarity scores and into the bounding box regression head to predict object locations. Given text prompts, the text encoder maps them into textual embeddings $\bm{F_T}$, which are used to guide query selection and vision-text alignment. Through the text-aware query selection, the top-ranked queries $Q_{\text{top}}$ are fed into the decoder for iterative refinement, while $Q_{\text{dn}}$ are used for the denoising loss denoising and $Q_{\text{sup}}$ serve as additional queries for computing Fixed AP FixedAP. Each query comprises both a feature embedding and location information.
Figure 2: Effectiveness of GridSynthetic. The EMA-smoothed GIoU loss curves show that GridSynthetic consistently achieves the lowest loss throughout training, alleviating the difficulty of localization and leading to more sufficient classification supervision and cleaner semantic alignment.
Figure 3: Visualizations of Zero-shot Inference on LVIS lvis. We employ the pretrained OV-DEIM-L model and perform inference with the LVIS vocabulary containing 1,203 categories, only using a confidence threshold of 0.5.

OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

TL;DR

Abstract

OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)