Auto-Vocabulary 3D Object Detection
Haomeng Zhang, Kuan-Chuan Peng, Suhas Lohit, Raymond A. Yeh
TL;DR
This work introduces Auto-Vocabulary 3D Object Detection (AV3DOD), a framework that autonomously discovers and expands semantic vocabularies for 3D object detection by leveraging 2D vision-language models, pseudo-3D box proposals, and embedding-space semantics expansion. It couples a class-agnostic localization module with a semantic exploration module and a cross-modal alignment objective, guided by a new Semantic Score that jointly assesses localization accuracy and semantic quality. The approach achieves state-of-the-art results on ScanNetV2 and SUNRGB-D, outperforming prior open-vocabulary methods and demonstrating notable gains from each component (VLM-derived captions, pseudo boxes, and FSSE). This opens pathways for truly open-world 3D understanding with autonomous semantic reasoning and minimal human supervision at training and inference.
Abstract
Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.
