Open-Det: An Efficient Learning Framework for Open-Ended Detection

Guiping Cao; Tao Wang; Wenjian Huang; Xiangyuan Lan; Jianguo Zhang; Dongmei Jiang

Open-Det: An Efficient Learning Framework for Open-Ended Detection

Guiping Cao, Tao Wang, Wenjian Huang, Xiangyuan Lan, Jianguo Zhang, Dongmei Jiang

TL;DR

Open-Det tackles open-ended detection by decoupling and accelerating both bounding-box learning and object-name generation. It introduces BVLA-M for bidirectional vision-language alignment, VLD-M to distill VLM knowledge into VL-prompts, a LoRa-based Object Name Generator with Text Denoising, and Masked Alignment Loss plus Joint Loss to stabilize cross-modal supervision. The framework achieves superior data efficiency and faster convergence compared to GenerateU and GLIP variants, while using far fewer GPUs and training epochs. This work advances practical vocabulary-free open-world detection with strong cross-modal alignment and efficient training dynamics.

Abstract

Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes are available at: https://github.com/Med-Process/Open-Det.

Open-Det: An Efficient Learning Framework for Open-Ended Detection

TL;DR

Abstract

Open-Det: An Efficient Learning Framework for Open-Ended Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)