Table of Contents
Fetching ...

OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection

Adrian Chow, Evelien Riddell, Yimu Wang, Sean Sedwards, Krzysztof Czarnecki

TL;DR

OV-SCAN tackles open-vocabulary 3D object detection for autonomous driving by addressing semantic misalignment between 3D boxes and 2D text-friendly embeddings. It introduces SC-NOD to generate reliable 3D annotations and selective alignment to filter noisy cross-modal pairs, paired with the Hierarchical Two-Stage Alignment (H2SA) head to progressively align 3D features with 2D embeddings across scales. The method uses adaptive PSO-based 3D box search, one-to-many CMA, and a prompt-based classification pipeline to recognize fine-grained novel categories without 3D human annotations, achieving state-of-the-art results on nuScenes and KITTI. This approach advances practical open-set perception for autonomous systems by improving cross-modal fidelity and robust novel-object discovery under occlusion and low resolution scenarios.

Abstract

Open-vocabulary 3D object detection for autonomous driving aims to detect novel objects beyond the predefined training label sets in point cloud scenes. Existing approaches achieve this by connecting traditional 3D object detectors with vision-language models (VLMs) to regress 3D bounding boxes for novel objects and perform open-vocabulary classification through cross-modal alignment between 3D and 2D features. However, achieving robust cross-modal alignment remains a challenge due to semantic inconsistencies when generating corresponding 3D and 2D feature pairs. To overcome this challenge, we present OV-SCAN, an Open-Vocabulary 3D framework that enforces Semantically Consistent Alignment for Novel object discovery. OV-SCAN employs two core strategies: discovering precise 3D annotations and filtering out low-quality or corrupted alignment pairs (arising from 3D annotation, occlusion-induced, or resolution-induced noise). Extensive experiments on the nuScenes dataset demonstrate that OV-SCAN achieves state-of-the-art performance.

OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection

TL;DR

OV-SCAN tackles open-vocabulary 3D object detection for autonomous driving by addressing semantic misalignment between 3D boxes and 2D text-friendly embeddings. It introduces SC-NOD to generate reliable 3D annotations and selective alignment to filter noisy cross-modal pairs, paired with the Hierarchical Two-Stage Alignment (H2SA) head to progressively align 3D features with 2D embeddings across scales. The method uses adaptive PSO-based 3D box search, one-to-many CMA, and a prompt-based classification pipeline to recognize fine-grained novel categories without 3D human annotations, achieving state-of-the-art results on nuScenes and KITTI. This approach advances practical open-set perception for autonomous systems by improving cross-modal fidelity and robust novel-object discovery under occlusion and low resolution scenarios.

Abstract

Open-vocabulary 3D object detection for autonomous driving aims to detect novel objects beyond the predefined training label sets in point cloud scenes. Existing approaches achieve this by connecting traditional 3D object detectors with vision-language models (VLMs) to regress 3D bounding boxes for novel objects and perform open-vocabulary classification through cross-modal alignment between 3D and 2D features. However, achieving robust cross-modal alignment remains a challenge due to semantic inconsistencies when generating corresponding 3D and 2D feature pairs. To overcome this challenge, we present OV-SCAN, an Open-Vocabulary 3D framework that enforces Semantically Consistent Alignment for Novel object discovery. OV-SCAN employs two core strategies: discovering precise 3D annotations and filtering out low-quality or corrupted alignment pairs (arising from 3D annotation, occlusion-induced, or resolution-induced noise). Extensive experiments on the nuScenes dataset demonstrate that OV-SCAN achieves state-of-the-art performance.

Paper Structure

This paper contains 27 sections, 13 equations, 10 figures, 9 tables, 2 algorithms.

Figures (10)

  • Figure 1: Cross-modal Alignment Performance. The red CDF shows the distribution of the distance between the 3D embedding produced by a baseline OV-3D detector and the corresponding 2D embedding from CLIP on the nuScenes validation set. The green CDF shows this distribution using OV-SCAN instead of the baseline. The latter is shifted to the left, showing an improved alignment between the 3D and 2D embeddings due to supervision by our higher quality 3D-2D proposal pairings and our alignment head.
  • Figure 2: 3D Annotation Errors. Common 3D annotation errors during box parametrization, including but not limited to, poor L-shape fitting, misinterpreted surfaces, and misaligned surfaces.
  • Figure 3: Sources of Semantic Discrepancies. (a) CLIP similarity scores for a truck reveals that occlusion cases result in an ambiguous 2D image feature. (b) CLIP similarity scores for a distant pedestrian demonstrate that insufficient resolution leads to degraded 2D image feature.
  • Figure 4: Overall Framework for OV-SCAN. During novel object discovery, SC-NOD associates novel object proposals with corresponding object clusters, creating cross-modal proposals. SC-NOD performs an adaptive search to fit 3D annotations and extracts 2D image features to prepare cross-modal targets for supervision. SC-NOD identifies which samples are fit for alignment based on cross-modal semantic consistency. During training, all 3D annotations are used, while only consistent novel objects guide cross-modal alignment.
  • Figure 5: Illustration of the Hierarchical Two-Stage Alignment (H2SA) Head. H2SA first predicts the high-level novel classes, then derives class-based text prompts. H2SA then uses text prototypes to incrementally map 3D features to their 2D counterparts.
  • ...and 5 more figures