Table of Contents
Fetching ...

Auto-Vocabulary 3D Object Detection

Haomeng Zhang, Kuan-Chuan Peng, Suhas Lohit, Raymond A. Yeh

TL;DR

This work introduces Auto-Vocabulary 3D Object Detection (AV3DOD), a framework that autonomously discovers and expands semantic vocabularies for 3D object detection by leveraging 2D vision-language models, pseudo-3D box proposals, and embedding-space semantics expansion. It couples a class-agnostic localization module with a semantic exploration module and a cross-modal alignment objective, guided by a new Semantic Score that jointly assesses localization accuracy and semantic quality. The approach achieves state-of-the-art results on ScanNetV2 and SUNRGB-D, outperforming prior open-vocabulary methods and demonstrating notable gains from each component (VLM-derived captions, pseudo boxes, and FSSE). This opens pathways for truly open-world 3D understanding with autonomous semantic reasoning and minimal human supervision at training and inference.

Abstract

Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.

Auto-Vocabulary 3D Object Detection

TL;DR

This work introduces Auto-Vocabulary 3D Object Detection (AV3DOD), a framework that autonomously discovers and expands semantic vocabularies for 3D object detection by leveraging 2D vision-language models, pseudo-3D box proposals, and embedding-space semantics expansion. It couples a class-agnostic localization module with a semantic exploration module and a cross-modal alignment objective, guided by a new Semantic Score that jointly assesses localization accuracy and semantic quality. The approach achieves state-of-the-art results on ScanNetV2 and SUNRGB-D, outperforming prior open-vocabulary methods and demonstrating notable gains from each component (VLM-derived captions, pseudo boxes, and FSSE). This opens pathways for truly open-world 3D understanding with autonomous semantic reasoning and minimal human supervision at training and inference.

Abstract

Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.

Paper Structure

This paper contains 22 sections, 20 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: We introduce the task of Auto-Vocabulary 3D Object Detection (AV3DOD). (a) Existing Open-Vocabulary 3D Object Detection (OV3DOD) relies on a user-defined vocabulary ${\mathcal{V}}^{\text{pre}}$ and the input point cloud ${\mathcal{P}}$ to predict 3D objects ${\mathcal{B}}$, including bounding boxes and corresponding class labels. (b) In contrast, AV3DOD needs the model to autonomously derive its own vocabulary and inference directly on ${\mathcal{P}}$ without any predefined ${\mathcal{V}}^{\text{pre}}$.
  • Figure 2: Illustration of AV3DOD (Sec. \ref{['sec:overview']}).AV3DOD consists of three key components: (1) Object Localization, which uses $F_{\text{det}}$ to generate class-agnostic 3D object proposals $\{{\bm{l}}_i, {\bm{f}}_i^{3D}\}_{i=1}^N$ from ${\mathcal{P}}$; (2) Novel Semantics Exploration (Sec. \ref{['sec:semantic_exploration']}), which constructs ${\mathcal{F}}^{\text{S}}$ by integrating base class features ${\mathcal{F}}^{\text{B}}$, VLM caption features ${\mathcal{F}}^{\text{C}}$, pseudo label features ${\mathcal{F}}^{\text{P}}$, and expanded features ${\mathcal{F}}^{\text{E}}$; and (3) Semantic Alignment, which aligns the detected 3D object features with ${\mathcal{F}}^{\text{S}}$ to predict class labels $\{c_i\}_{i=1}^N$. Input ${\mathcal{P}}$ is colorized for visualization.
  • Figure 3: Illustration of pseudo 3D box generation. A 2D VLM produces an object segmentation mask $M_k^P$ and a textual label $c_k^P$ for each detected object. The point cloud ${\mathcal{P}}$ is projected onto the image $I$ to extract the corresponding point group ${\mathcal{G}}^P_k$, from which a pseudo 3D bounding box is estimated via PCA. Input ${\mathcal{P}}$ is colorized for visualization.
  • Figure 4: Impact of feature space semantics expansion. We compare the distribution of pairwise cosine similarity between vocabulary features before and after expansion. The expanded feature space is generated by sampling 30% extra features, with similarity thresholds $\theta^{E}_{\text{min}} = 0.6$ and $\theta^{E}_{\text{max}} = 0.9$. The highlighted region on the right demonstrates the newly introduced diverse feature pairs, indicating broader semantics after expansion.
  • Figure 5: Qualitative results on ScanNet validation set. 3D bounding boxes are projected onto RGB images for visualization. Objects from base/novel classes are shown in teal/lavender respectively.
  • ...and 1 more figures