Table of Contents
Fetching ...

OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

Yu Wang, Xiangbo Su, Qiang Chen, Xinyu Zhang, Teng Xi, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang

TL;DR

The paper addresses open-vocabulary object detection under real-time constraints by proposing OVLW-DETR, a DETR-based detector that aligns with a vision-language model's text encoder through simple, fusion-free weighting of class-name embeddings. The detector preserves the light-weight LW-DETR architecture, while the VLM text encoder supplies open-vocabulary semantics via embeddings used in place of fixed classifier weights. Training keeps the text encoder frozen and optimizes with IoU-aware and IA-BCE losses, along with parallel weight-sharing decoders (Group DETR), yielding stable, efficient learning. Empirically, OVLW-DETR achieves strong zero-shot LVIS performance while maintaining low end-to-end latency on a T4 FP16 setup, outperforming recent real-time open-vocabulary baselines and enabling deployment-friendly open-vocabulary detection. The work provides a practical path to deploy open-vocabulary detectors in real-world scenarios with minimal fusion modules and straightforward training.

Abstract

Open-vocabulary object detection focusing on detecting novel categories guided by natural language. In this report, we propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency. Building upon OVLW-DETR, we provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment. We align detector with the text encoder from VLM by replacing the fixed classification layer weights in detector with the class-name embeddings extracted from the text encoder. Without additional fusing module, OVLW-DETR is flexible and deployment friendly, making it easier to implement and modulate. improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark. Source code and pre-trained models are available at [https://github.com/Atten4Vis/LW-DETR].

OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

TL;DR

The paper addresses open-vocabulary object detection under real-time constraints by proposing OVLW-DETR, a DETR-based detector that aligns with a vision-language model's text encoder through simple, fusion-free weighting of class-name embeddings. The detector preserves the light-weight LW-DETR architecture, while the VLM text encoder supplies open-vocabulary semantics via embeddings used in place of fixed classifier weights. Training keeps the text encoder frozen and optimizes with IoU-aware and IA-BCE losses, along with parallel weight-sharing decoders (Group DETR), yielding stable, efficient learning. Empirically, OVLW-DETR achieves strong zero-shot LVIS performance while maintaining low end-to-end latency on a T4 FP16 setup, outperforming recent real-time open-vocabulary baselines and enabling deployment-friendly open-vocabulary detection. The work provides a practical path to deploy open-vocabulary detectors in real-world scenarios with minimal fusion modules and straightforward training.

Abstract

Open-vocabulary object detection focusing on detecting novel categories guided by natural language. In this report, we propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency. Building upon OVLW-DETR, we provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment. We align detector with the text encoder from VLM by replacing the fixed classification layer weights in detector with the class-name embeddings extracted from the text encoder. Without additional fusing module, OVLW-DETR is flexible and deployment friendly, making it easier to implement and modulate. improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark. Source code and pre-trained models are available at [https://github.com/Atten4Vis/LW-DETR].
Paper Structure (9 sections, 1 figure, 1 table)

This paper contains 9 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: OVLW-DETR framework