Table of Contents
Fetching ...

OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

Zhenyu Wang, Yali Li, Taichi Liu, Hengshuang Zhao, Shengjin Wang

TL;DR

OV-Uni3DETR tackles the scarcity of annotated 3D data and modality fragmentation by proposing a unified multi-modal detector trained on point clouds, 3D-detection images, and 2D detection images. It introduces cycle-modality propagation to transfer semantic knowledge from 2D to 3D and geometric knowledge from 3D to 2D, enabling open-vocabulary detection, test-time modality switching, and scene unification. The approach achieves state-of-the-art open-vocabulary performance across indoor and outdoor datasets and demonstrates competitive closed-vocabulary results, with RGB-only inference rivaling point-cloud methods and multi-modal fusion providing additional boosts. This work significantly advances toward universal 3D object detection by unifying modalities, scenes, and vocabularies, with practical implications for robotics, autonomous driving, and large-scale 3D understanding.

Abstract

In the current state of 3D object detection research, the severe scarcity of annotated 3D data, substantial disparities across different data modalities, and the absence of a unified architecture, have impeded the progress towards the goal of universality. In this paper, we propose \textbf{OV-Uni3DETR}, a unified open-vocabulary 3D detector via cycle-modality propagation. Compared with existing 3D detectors, OV-Uni3DETR offers distinct advantages: 1) Open-vocabulary 3D detection: During training, it leverages various accessible data, especially extensive 2D detection images, to boost training diversity. During inference, it can detect both seen and unseen classes. 2) Modality unifying: It seamlessly accommodates input data from any given modality, effectively addressing scenarios involving disparate modalities or missing sensor information, thereby supporting test-time modality switching. 3) Scene unifying: It provides a unified multi-modal model architecture for diverse scenes collected by distinct sensors. Specifically, we propose the cycle-modality propagation, aimed at propagating knowledge bridging 2D and 3D modalities, to support the aforementioned functionalities. 2D semantic knowledge from large-vocabulary learning guides novel class discovery in the 3D domain, and 3D geometric knowledge provides localization supervision for 2D detection images. OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6\% on average. Its performance using only RGB images is on par with or even surpasses that of previous point cloud based methods. Code and pre-trained models will be released later.

OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

TL;DR

OV-Uni3DETR tackles the scarcity of annotated 3D data and modality fragmentation by proposing a unified multi-modal detector trained on point clouds, 3D-detection images, and 2D detection images. It introduces cycle-modality propagation to transfer semantic knowledge from 2D to 3D and geometric knowledge from 3D to 2D, enabling open-vocabulary detection, test-time modality switching, and scene unification. The approach achieves state-of-the-art open-vocabulary performance across indoor and outdoor datasets and demonstrates competitive closed-vocabulary results, with RGB-only inference rivaling point-cloud methods and multi-modal fusion providing additional boosts. This work significantly advances toward universal 3D object detection by unifying modalities, scenes, and vocabularies, with practical implications for robotics, autonomous driving, and large-scale 3D understanding.

Abstract

In the current state of 3D object detection research, the severe scarcity of annotated 3D data, substantial disparities across different data modalities, and the absence of a unified architecture, have impeded the progress towards the goal of universality. In this paper, we propose \textbf{OV-Uni3DETR}, a unified open-vocabulary 3D detector via cycle-modality propagation. Compared with existing 3D detectors, OV-Uni3DETR offers distinct advantages: 1) Open-vocabulary 3D detection: During training, it leverages various accessible data, especially extensive 2D detection images, to boost training diversity. During inference, it can detect both seen and unseen classes. 2) Modality unifying: It seamlessly accommodates input data from any given modality, effectively addressing scenarios involving disparate modalities or missing sensor information, thereby supporting test-time modality switching. 3) Scene unifying: It provides a unified multi-modal model architecture for diverse scenes collected by distinct sensors. Specifically, we propose the cycle-modality propagation, aimed at propagating knowledge bridging 2D and 3D modalities, to support the aforementioned functionalities. 2D semantic knowledge from large-vocabulary learning guides novel class discovery in the 3D domain, and 3D geometric knowledge provides localization supervision for 2D detection images. OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6\% on average. Its performance using only RGB images is on par with or even surpasses that of previous point cloud based methods. Code and pre-trained models will be released later.
Paper Structure (40 sections, 2 equations, 8 figures, 13 tables, 1 algorithm)

This paper contains 40 sections, 2 equations, 8 figures, 13 tables, 1 algorithm.

Figures (8)

  • Figure 1: Illustration for OV-Uni3DETR. (a): It utilizes various available data for training, including 3D point clouds, 3D detection images (with 3D box annotated and aligning with point clouds) and 2D detection images (only 2D box annotated). For inference, it can predict 3D boxes using any modality data, for both open-vocabulary and closed-vocabulary, both indoor and outdoor 3D detection. (b): Compared with existing 3D detectors, OV-Uni3DETR achieves modality unifying (modality-switchable during inference), scene unifying, and open-vocabulary learning simultaneously.
  • Figure 2: Overview of OV-Uni3DETR. We extract features for point clouds and images. After converted into the same voxel space, they are added for the multi-modal features. The 3D detection transformer is finally utilized for class and box prediction. We perform semantic knowledge propagation from 2D to 3D for novel class discovery. To use 2D detection images, we predict the camera extrinsic parameters and propagate geometric knowledge from 3D to 2D through a class-agnostic (CA) 3D detector.
  • Figure 3: Illustration of knowledge propagation from 3D to 2D. 3D detection images first train a class-agnostic 3D detector, which is then used to generate class-agnostic 3D bounding boxes with the predicted camera parameters for the 2D detection images. Hungarian matching is finally conducted between 2D boxes and 3D ones for the class-specific 3D bounding boxes.
  • Figure 4: Visualization of OV-Uni3DETR for open-vocabulary 3D detection on the SUN RGB-D (the first and the second), ScanNet (the third) and KITTI (the fourth) dataset. The red boxes are base classes and blue boxes are novel classes.
  • Figure 5: Visualization of OV-Uni3DETR for open-vocabulary 3D detection on SUN RGB-D. The red boxes are base classes and blue boxes are novel classes.
  • ...and 3 more figures