Table of Contents
Fetching ...

Open-Det: An Efficient Learning Framework for Open-Ended Detection

Guiping Cao, Tao Wang, Wenjian Huang, Xiangyuan Lan, Jianguo Zhang, Dongmei Jiang

TL;DR

Open-Det tackles open-ended detection by decoupling and accelerating both bounding-box learning and object-name generation. It introduces BVLA-M for bidirectional vision-language alignment, VLD-M to distill VLM knowledge into VL-prompts, a LoRa-based Object Name Generator with Text Denoising, and Masked Alignment Loss plus Joint Loss to stabilize cross-modal supervision. The framework achieves superior data efficiency and faster convergence compared to GenerateU and GLIP variants, while using far fewer GPUs and training epochs. This work advances practical vocabulary-free open-world detection with strong cross-modal alignment and efficient training dynamics.

Abstract

Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes are available at: https://github.com/Med-Process/Open-Det.

Open-Det: An Efficient Learning Framework for Open-Ended Detection

TL;DR

Open-Det tackles open-ended detection by decoupling and accelerating both bounding-box learning and object-name generation. It introduces BVLA-M for bidirectional vision-language alignment, VLD-M to distill VLM knowledge into VL-prompts, a LoRa-based Object Name Generator with Text Denoising, and Masked Alignment Loss plus Joint Loss to stabilize cross-modal supervision. The framework achieves superior data efficiency and faster convergence compared to GenerateU and GLIP variants, while using far fewer GPUs and training epochs. This work advances practical vocabulary-free open-world detection with strong cross-modal alignment and efficient training dynamics.

Abstract

Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes are available at: https://github.com/Med-Process/Open-Det.

Paper Structure

This paper contains 39 sections, 9 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Performance curves of GenerateU and Open-Det, trained on the VG and evaluated on zero-shot LVIS MiniVal.
  • Figure 2: Main architecture of the Open-Det framework. It consists of 4 collaborative components: (1) Object Detector (ODR) for accelerating the bounding box training; (2) Prompts Distiller with Vision-to-Language Distillation module (VLD-M) to bridge the semantic gap between Vision and Language; (3) Object Name Generator with the Text Denoising approach to accelerate the training of the LoRa Head; (4) Vision-Language Aligner with BVLA-M to enhance the alignment of Vision and Language. The Masked Alignment Loss and Joint Loss are introduced for correcting the supervision information and enhancing binary classification consistency, respectively. Please refer to Sec. \ref{['sec:app_pipeline']} for simplified pipeline.
  • Figure 3: Overall architecture of the proposed VLD-M.
  • Figure 4: Visualization results for Ground Truth, GenerateU, and Open-Det on the LVIS MiniVal dataset. Open-Det demonstrates superior capability in detecting a broader range of potential objects in images (indicated by yellow arrows), covering large-scale objects, small objects, and fine-grained details, such as cabinet, rug, light, radiator, and fireplace in (a); sidewalk, shadows, street, and small sign in (b); and sky, sky lift, head, and pole in (c); wall, ear, eye, and pillow in (d).
  • Figure 5: The simplified pipeline of the Open-Det framework. The Vision-Language Model (VLM) model and input texts are only used in the training phase. The symbols and represent that the model weights are activated and frozen, respectively.
  • ...and 9 more figures