OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection
Adrian Chow, Evelien Riddell, Yimu Wang, Sean Sedwards, Krzysztof Czarnecki
TL;DR
OV-SCAN tackles open-vocabulary 3D object detection for autonomous driving by addressing semantic misalignment between 3D boxes and 2D text-friendly embeddings. It introduces SC-NOD to generate reliable 3D annotations and selective alignment to filter noisy cross-modal pairs, paired with the Hierarchical Two-Stage Alignment (H2SA) head to progressively align 3D features with 2D embeddings across scales. The method uses adaptive PSO-based 3D box search, one-to-many CMA, and a prompt-based classification pipeline to recognize fine-grained novel categories without 3D human annotations, achieving state-of-the-art results on nuScenes and KITTI. This approach advances practical open-set perception for autonomous systems by improving cross-modal fidelity and robust novel-object discovery under occlusion and low resolution scenarios.
Abstract
Open-vocabulary 3D object detection for autonomous driving aims to detect novel objects beyond the predefined training label sets in point cloud scenes. Existing approaches achieve this by connecting traditional 3D object detectors with vision-language models (VLMs) to regress 3D bounding boxes for novel objects and perform open-vocabulary classification through cross-modal alignment between 3D and 2D features. However, achieving robust cross-modal alignment remains a challenge due to semantic inconsistencies when generating corresponding 3D and 2D feature pairs. To overcome this challenge, we present OV-SCAN, an Open-Vocabulary 3D framework that enforces Semantically Consistent Alignment for Novel object discovery. OV-SCAN employs two core strategies: discovering precise 3D annotations and filtering out low-quality or corrupted alignment pairs (arising from 3D annotation, occlusion-induced, or resolution-induced noise). Extensive experiments on the nuScenes dataset demonstrate that OV-SCAN achieves state-of-the-art performance.
