Table of Contents
Fetching ...

Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments

Djamahl Etchegaray, Zi Huang, Tatsuya Harada, Yadan Luo

TL;DR

This work tackles open-vocabulary open-set 3D object detection in urban LiDAR scenes by leveraging vision–language models through four baseline strategies and introducing a universal Find n' Propagate framework. The approach combines a Greedy Box Seeker that exhaustively searches 3D proposals from 2D region predictions, a Greedy Box Oracle that enforces proposal quality via multi-view alignment and density-based scoring, and a Remote Propagator with geometry and density simulators to diversify pseudo-labels for distant objects, all integrated into a memory-bank self-training loop. Across nuScenes and KITTI with several OV protocols, the method achieves substantial gains in novel recall (up to 53%) and novel-object AP improvements (up to 3.97×), while bottom-up variants reach notable recall and AP increases (e.g., AR_N up by 21% and AP_N by ~3.9×). The results demonstrate that explicit near-to-far propagation and bias-robust proposal generation can significantly enhance open-vocabulary 3D detection in complex urban environments, with practical implications for safe autonomous operation and scalable concept expansion.

Abstract

In this work, we tackle the limitations of current LiDAR-based 3D object detection systems, which are hindered by a restricted class vocabulary and the high costs associated with annotating new object classes. Our exploration of open-vocabulary (OV) learning in urban environments aims to capture novel instances using pre-trained vision-language models (VLMs) with multi-sensor data. We design and benchmark a set of four potential solutions as baselines, categorizing them into either top-down or bottom-up approaches based on their input data strategies. While effective, these methods exhibit certain limitations, such as missing novel objects in 3D box estimation or applying rigorous priors, leading to biases towards objects near the camera or of rectangular geometries. To overcome these limitations, we introduce a universal \textsc{Find n' Propagate} approach for 3D OV tasks, aimed at maximizing the recall of novel objects and propagating this detection capability to more distant areas thereby progressively capturing more. In particular, we utilize a greedy box seeker to search against 3D novel boxes of varying orientations and depth in each generated frustum and ensure the reliability of newly identified boxes by cross alignment and density ranker. Additionally, the inherent bias towards camera-proximal objects is alleviated by the proposed remote simulator, which randomly diversifies pseudo-labeled novel instances in the self-training process, combined with the fusion of base samples in the memory bank. Extensive experiments demonstrate a 53% improvement in novel recall across diverse OV settings, VLMs, and 3D detectors. Notably, we achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes. The source code is made available at https://github.com/djamahl99/findnpropagate.

Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments

TL;DR

This work tackles open-vocabulary open-set 3D object detection in urban LiDAR scenes by leveraging vision–language models through four baseline strategies and introducing a universal Find n' Propagate framework. The approach combines a Greedy Box Seeker that exhaustively searches 3D proposals from 2D region predictions, a Greedy Box Oracle that enforces proposal quality via multi-view alignment and density-based scoring, and a Remote Propagator with geometry and density simulators to diversify pseudo-labels for distant objects, all integrated into a memory-bank self-training loop. Across nuScenes and KITTI with several OV protocols, the method achieves substantial gains in novel recall (up to 53%) and novel-object AP improvements (up to 3.97×), while bottom-up variants reach notable recall and AP increases (e.g., AR_N up by 21% and AP_N by ~3.9×). The results demonstrate that explicit near-to-far propagation and bias-robust proposal generation can significantly enhance open-vocabulary 3D detection in complex urban environments, with practical implications for safe autonomous operation and scalable concept expansion.

Abstract

In this work, we tackle the limitations of current LiDAR-based 3D object detection systems, which are hindered by a restricted class vocabulary and the high costs associated with annotating new object classes. Our exploration of open-vocabulary (OV) learning in urban environments aims to capture novel instances using pre-trained vision-language models (VLMs) with multi-sensor data. We design and benchmark a set of four potential solutions as baselines, categorizing them into either top-down or bottom-up approaches based on their input data strategies. While effective, these methods exhibit certain limitations, such as missing novel objects in 3D box estimation or applying rigorous priors, leading to biases towards objects near the camera or of rectangular geometries. To overcome these limitations, we introduce a universal \textsc{Find n' Propagate} approach for 3D OV tasks, aimed at maximizing the recall of novel objects and propagating this detection capability to more distant areas thereby progressively capturing more. In particular, we utilize a greedy box seeker to search against 3D novel boxes of varying orientations and depth in each generated frustum and ensure the reliability of newly identified boxes by cross alignment and density ranker. Additionally, the inherent bias towards camera-proximal objects is alleviated by the proposed remote simulator, which randomly diversifies pseudo-labeled novel instances in the self-training process, combined with the fusion of base samples in the memory bank. Extensive experiments demonstrate a 53% improvement in novel recall across diverse OV settings, VLMs, and 3D detectors. Notably, we achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes. The source code is made available at https://github.com/djamahl99/findnpropagate.
Paper Structure (14 sections, 6 equations, 3 figures, 6 tables)

This paper contains 14 sections, 6 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The Comparison between four proposed OV-3D Top-down and Bottom-up baselines.
  • Figure 2: The proposed Find n' Propagate approach. The framework aims to maximise the recall of novel objects through a Greedy Box Seeker and control the quality of newly identified boxes with Greedy Box Oracle. To propagate the knowledge to distant areas, a Remote Propagator is applied, which allows diverse remote novel instances to be progressively captured.
  • Figure 3: Visualisation of open-vocabulary 3D detection results ($\S$Setting 1).