Table of Contents
Fetching ...

A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

Yonghao He, Hu Su, Haiyong Yu, Cong Yang, Wei Sui, Cong Wang, Song Liu

TL;DR

Open-set object detection is critical for robots operating in unstructured environments, but existing solutions suffer from heavy computation and deployment challenges. DOSOD provides a lightweight, decoupled approach that projects vision-language text embeddings into a joint space with an MLP adaptor and performs alignment with detector features without expensive cross-modality interactions. It achieves competitive accuracy on LVIS and COCO while delivering substantial real-time speedups and edge- deployment feasibility, demonstrated on RTX 4090 and edge kits. The method is simple to deploy, reuses a YOLO-based detector, and is supported by a public code release for practical OSOD in robotics.

Abstract

Open-set object detection (OSOD) is highly desirable for robotic manipulation in unstructured environments. However, existing OSOD methods often fail to meet the requirements of robotic applications due to their high computational burden and complex deployment. To address this issue, this paper proposes a light-weight framework called Decoupled OSOD (DOSOD), which is a practical and highly efficient solution to support real-time OSOD tasks in robotic systems. Specifically, DOSOD builds upon the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to transform text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD operates like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection. Compared to the baseline YOLO-World, the proposed DOSOD significantly enhances real-time performance while maintaining comparable accuracy. The slight DOSOD-S model achieves a Fixed AP of $26.7\%$, compared to $26.2\%$ for YOLO-World-v1-S and $22.7\%$ for YOLO-World-v2-S, using similar backbones on the LVIS minival dataset. Meanwhile, the FPS of DOSOD-S is $57.1\%$ higher than YOLO-World-v1-S and $29.6\%$ higher than YOLO-World-v2-S. Meanwhile, we demonstrate that the DOSOD model facilitates the deployment of edge devices. The codes and models are publicly available at https://github.com/D-Robotics-AI-Lab/DOSOD.

A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

TL;DR

Open-set object detection is critical for robots operating in unstructured environments, but existing solutions suffer from heavy computation and deployment challenges. DOSOD provides a lightweight, decoupled approach that projects vision-language text embeddings into a joint space with an MLP adaptor and performs alignment with detector features without expensive cross-modality interactions. It achieves competitive accuracy on LVIS and COCO while delivering substantial real-time speedups and edge- deployment feasibility, demonstrated on RTX 4090 and edge kits. The method is simple to deploy, reuses a YOLO-based detector, and is supported by a public code release for practical OSOD in robotics.

Abstract

Open-set object detection (OSOD) is highly desirable for robotic manipulation in unstructured environments. However, existing OSOD methods often fail to meet the requirements of robotic applications due to their high computational burden and complex deployment. To address this issue, this paper proposes a light-weight framework called Decoupled OSOD (DOSOD), which is a practical and highly efficient solution to support real-time OSOD tasks in robotic systems. Specifically, DOSOD builds upon the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to transform text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD operates like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection. Compared to the baseline YOLO-World, the proposed DOSOD significantly enhances real-time performance while maintaining comparable accuracy. The slight DOSOD-S model achieves a Fixed AP of , compared to for YOLO-World-v1-S and for YOLO-World-v2-S, using similar backbones on the LVIS minival dataset. Meanwhile, the FPS of DOSOD-S is higher than YOLO-World-v1-S and higher than YOLO-World-v2-S. Meanwhile, we demonstrate that the DOSOD model facilitates the deployment of edge devices. The codes and models are publicly available at https://github.com/D-Robotics-AI-Lab/DOSOD.

Paper Structure

This paper contains 17 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Feature alignment strategy. (a) illustrates the teacher-student distillation approach, where image features extracted by the VLM and the detector are aligned under the supervision of text embeddings generated by the VLM's text encoder. Alternatively, proposals can be used to crop the image, with region features then aligned in a similar manner. (b) shows the interaction-based alignment strategy, where text embeddings interact with image features extracted by the detector's backbone to achieve alignment. (c) presents the proposed decoupled alignment strategy, which aligns features without any interaction.
  • Figure 2: The difference in the last layer of the classification branch between closed-set and open-set detection
  • Figure 3: Overview of our DOSOD framework. A detector learns class-agnostic proposals, and the category text embeddings for these proposals are computed using the VLM's text encoder. The embeddings are transformed by the MLP based adaptor and then aligned with the region features extracted by the detector. The transformed text embeddings serve as the classifier. During inference, text embeddings of novel categories are used to enable zero-shot detection.