Table of Contents
Fetching ...

Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

Rui Huang, Henry Zheng, Yan Wang, Zhuofan Xia, Marco Pavone, Gao Huang

TL;DR

A novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det, which trains detectors using only RGB images, making it both cost-effective and scalable to publicly available data.

Abstract

Open-vocabulary 3D object detection has recently attracted considerable attention due to its broad applications in autonomous driving and robotics, which aims to effectively recognize novel classes in previously unseen domains. However, existing point cloud-based open-vocabulary 3D detection models are limited by their high deployment costs. In this work, we propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det, which trains detectors using only RGB images, making it both cost-effective and scalable to publicly available data. Unlike traditional methods, OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes. Instead, it employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors. However, training 3D models with labels directly derived from pseudo-LiDAR is inadequate due to imprecise boxes estimated from noisy point clouds and severely occluded objects. To address these issues, we introduce two innovative designs: adaptive pseudo-LiDAR erosion and bounding box refinement with prior knowledge from large language models. These techniques effectively calibrate the 3D labels and enable RGB-only training for 3D detectors. Extensive experiments demonstrate the superiority of OVM3D-Det over baselines in both indoor and outdoor scenarios. The code will be released.

Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

TL;DR

A novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det, which trains detectors using only RGB images, making it both cost-effective and scalable to publicly available data.

Abstract

Open-vocabulary 3D object detection has recently attracted considerable attention due to its broad applications in autonomous driving and robotics, which aims to effectively recognize novel classes in previously unseen domains. However, existing point cloud-based open-vocabulary 3D detection models are limited by their high deployment costs. In this work, we propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det, which trains detectors using only RGB images, making it both cost-effective and scalable to publicly available data. Unlike traditional methods, OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes. Instead, it employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors. However, training 3D models with labels directly derived from pseudo-LiDAR is inadequate due to imprecise boxes estimated from noisy point clouds and severely occluded objects. To address these issues, we introduce two innovative designs: adaptive pseudo-LiDAR erosion and bounding box refinement with prior knowledge from large language models. These techniques effectively calibrate the 3D labels and enable RGB-only training for 3D detectors. Extensive experiments demonstrate the superiority of OVM3D-Det over baselines in both indoor and outdoor scenarios. The code will be released.

Paper Structure

This paper contains 41 sections, 2 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Comparison between point cloud-based and image-based open-vocabulary 3D object detection methods. During training, point cloud-based approaches require corresponding point cloud and image data to derive pseudo labels, while image-based methods can leverage large-scale image data and the most advanced depth estimation models for pseudo-label generation. During inference, point cloud-based methods necessitate expensive LiDAR or other 3D sensors for deployment, whereas image-based approaches only require a camera.
  • Figure 2: Comparison between LiDAR data and pseudo-LiDAR. Although pseudo-LiDAR is much denser than LiDAR, it is highly noisy (as highlighted in the red boxes), making it inadequate for directly generating accurate 3D bounding boxes.
  • Figure 3: The overall framework of OVM3D-Det. Step ①: Generate per-instance pseudo-LiDAR. Step ②: Apply an adaptive erosion process to remove artifacts and noises. Step ③: Estimate the orientation. Step ④: Tightly fit a box and utilize object priors to assess the estimated box; if deemed unreasonable, search for the optimal box. Step ⑤: Train the model with pseudo labels.
  • Figure 4: Ray tracing loss and point ratio loss.
  • Figure 5: Qualitative results on SUN RGB-D and KITTI.
  • ...and 3 more figures