Table of Contents
Fetching ...

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Pengkun Jiao, Na Zhao, Jingjing Chen, Yu-Gang Jiang

TL;DR

This work tackles open-vocabulary 3D object detection under limited 3D training data. It introduces INHA, combining image-guided novel class discovery (IGND) with hierarchical cross-modal alignment to map 3D features to vision-language space at instance, category, and scene levels, aided by a PISE module and CLIP-based encodings. Through a three-stage training regime on SUN RGB-D and ScanNet datasets, INHA achieves state-of-the-art results for novel classes and improves base-class performance, validating the benefit of leveraging vision-language foundation models for 3D open-vocabulary learning. The findings demonstrate that comprehensive guidance from text and images can substantially enhance 3D object recognition in real-world scenarios.

Abstract

Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. While language and vision foundation models have achieved success in handling various open-vocabulary tasks with abundant training data, OV-3DDet faces a significant challenge due to the limited availability of training data. Although some pioneering efforts have integrated vision-language models (VLM) knowledge into OV-3DDet learning, the full potential of these foundational models has yet to be fully exploited. In this paper, we unlock the textual and visual wisdom to tackle the open-vocabulary 3D detection task by leveraging the language and vision foundation models. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. Specifically, we utilize a object detection vision foundation model to enable the zero-shot discovery of objects in images, which serves as the initial seeds and filtering guidance to identify novel 3D objects. Additionally, to align the 3D space with the powerful vision-language space, we introduce a hierarchical alignment approach, where the 3D feature space is aligned with the vision-language feature space using a pre-trained VLM at the instance, category, and scene levels. Through extensive experimentation, we demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection in real-world scenarios.

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

TL;DR

This work tackles open-vocabulary 3D object detection under limited 3D training data. It introduces INHA, combining image-guided novel class discovery (IGND) with hierarchical cross-modal alignment to map 3D features to vision-language space at instance, category, and scene levels, aided by a PISE module and CLIP-based encodings. Through a three-stage training regime on SUN RGB-D and ScanNet datasets, INHA achieves state-of-the-art results for novel classes and improves base-class performance, validating the benefit of leveraging vision-language foundation models for 3D open-vocabulary learning. The findings demonstrate that comprehensive guidance from text and images can substantially enhance 3D object recognition in real-world scenarios.

Abstract

Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. While language and vision foundation models have achieved success in handling various open-vocabulary tasks with abundant training data, OV-3DDet faces a significant challenge due to the limited availability of training data. Although some pioneering efforts have integrated vision-language models (VLM) knowledge into OV-3DDet learning, the full potential of these foundational models has yet to be fully exploited. In this paper, we unlock the textual and visual wisdom to tackle the open-vocabulary 3D detection task by leveraging the language and vision foundation models. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. Specifically, we utilize a object detection vision foundation model to enable the zero-shot discovery of objects in images, which serves as the initial seeds and filtering guidance to identify novel 3D objects. Additionally, to align the 3D space with the powerful vision-language space, we introduce a hierarchical alignment approach, where the 3D feature space is aligned with the vision-language feature space using a pre-trained VLM at the instance, category, and scene levels. Through extensive experimentation, we demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection in real-world scenarios.
Paper Structure (16 sections, 5 equations, 7 figures, 5 tables)

This paper contains 16 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of the two key components of our proposed INHA: (a) Image-guided novel object discovery (IGND) employs a vision foundation model to extract 2D bboxes and use them as prior to guide the discovery of 3D novel objects. (b) Hierarchical feature space alignment aligns the 3D feature space with the vision-language feature space at the instance, class, and scene levels.
  • Figure 2: Illustration of our proposed INHA framework. Our INHA framework primarily consists of three stages. Firstly, we train a base detector solely utilizing base objects. Subsequently, in the second stage, we enhance the 3D detector by incorporating discovered novel classes from the IGND module. Finally, in the third stage, we perform hierarchical alignment of the 3D feature space with the vision-language feature space at the instance, category, and scene levels.
  • Figure 3: Illustration of IGND. (a) The overarching framework of IGND integrates the functionalities of both 2D and 3D detectors for novel object discovery. (b) In this step, the translation of 2D object centers into 3D space enriches the pool of query seeds, facilitating the generation of novel object proposals. (c) Projected onto 2D boxes, 3D bounding boxes are matched with 2D detected objects based on their IoU scores to select the most suitable candidates.
  • Figure 4: Architecture of PISE. The object features within a scene are concatenated and passed through two linear layers, sandwiched between a ReLU layer, to generate a high-dimensional feature. Subsequently, max pooling is applied to this high-dimensional feature to obtain a permutation-invariant scene-level feature.
  • Figure 5: Mean average recall (mAR) and mean average precision (mAP) for novel classes were tracked during training epochs on SUN RGB-D.
  • ...and 2 more figures