Towards Unified 3D Object Detection via Algorithm and Data Unification

Zhuoling Li; Xiaogang Xu; SerNam Lim; Hengshuang Zhao

Towards Unified 3D Object Detection via Algorithm and Data Unification

Zhuoling Li, Xiaogang Xu, SerNam Lim, Hengshuang Zhao

TL;DR

This work builds the first unified multi-modal 3D object detection benchmark MM-Omni3D and extends the aforementioned monocular detector to its multi-modal version, which is the first unified multi-modal 3D object detector.

Abstract

Realizing unified 3D object detection, including both indoor and outdoor scenes, holds great importance in applications like robot navigation. However, involving various scenarios of data to train models poses challenges due to their significantly distinct characteristics, \eg, diverse geometry properties and heterogeneous domain distributions. In this work, we propose to address the challenges from two perspectives, the algorithm perspective and data perspective. In terms of the algorithm perspective, we first build a monocular 3D object detector based on the bird's-eye-view (BEV) detection paradigm, where the explicit feature projection is beneficial to addressing the geometry learning ambiguity. In this detector, we split the classical BEV detection architecture into two stages and propose an uneven BEV grid design to handle the convergence instability caused by geometry difference between scenarios. Besides, we develop a sparse BEV feature projection strategy to reduce the computational cost and a unified domain alignment method to handle heterogeneous domains. From the data perspective, we propose to incorporate depth information to improve training robustness. Specifically, we build the first unified multi-modal 3D object detection benchmark MM-Omni3D and extend the aforementioned monocular detector to its multi-modal version, which is the first unified multi-modal 3D object detector. We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively. The experimental results reveal several insightful findings highlighting the benefits of multi-modal data and confirm the effectiveness of all the proposed strategies.

Towards Unified 3D Object Detection via Algorithm and Data Unification

TL;DR

Abstract

Paper Structure (21 sections, 7 equations, 8 figures, 11 tables)

This paper contains 21 sections, 7 equations, 8 figures, 11 tables.

Introduction
Related Work
Unified Monocular Detection
UniMODE Overall Framework
Two-Stage Detection Architecture
Uneven BEV Grid
Sparse BEV Feature Projection
Unified Domain Alignment
Unified Multi-modal Detection
MM-Omni3D Benchmark
MM-Omni3D Data Statistics
MM-UniMODE Overall Framework
Mutual Information Collaboration
Experiment
Monocular Performance Comparison
...and 6 more sections

Figures (8)

Figure 1: Sub-figures (a)$\sim$(d): Challenges of unified 3D object detection. (1) Comparing sub-figures (a) and (b), indoor objects are small and close, while outdoor objects are far and sparse. Besides, the camera parameters are highly varying. (2) Comparing sub-figures (a), (b), and (c), which correspond to a real-world indoor image, a real-world outdoor image, and a synthetic indoor image, the image styles are different. (3) Although the category "Picture" is labeled in sub-figure (c), it is not labeled in sub-figure (d), which suggests label conflict among different sub-datasets. Unlabeled objects are highlighted by red ellipses. Sub-figures (e)$\sim$(j): Illustrations of the MM-Omni3D benchmark, which showcase both the 3D box annotations and point clouds. The sub-figures clearly demonstrate the significant point cloud differences between different scenarios due to depth sensor discrepancies.
Figure 2: The overall detection framework of UniMODE. The illustrated modules proposed in this work include the proposal head, sparse BEV feature projection, uneven BEV feature grid, domain adaptive layer normalization, and class alignment loss.
Figure 3: Indoor and outdoor target position distributions in the BEV space. The brighter a point shows, the more targets the corresponding BEV grid contains. The perception camera is located at the point with the coordinate $(0, 0)$.
Figure 4: An example of heterogeneous label conflict among sub-datasets in Omni3D. As shown, "Window" is not labeled in ARKitScenes while labeled in Hypersim, so the unlabeled window in (a) could harm the convergence stability of detectors.
Figure 5: (a) Camera-view image. (b) Camera-view depth map. (c)$\sim$(f) The processed point clouds observed from the front view, top view, back view, and side view, respectively. As shown, unlike previous 3D object detection datasets song2015sun that store the point cloud of the whole scene, MM-Omni3D only provides the points visible from the camera view, which better meets the practical situation of online multi-modal 3D object detection.
...and 3 more figures

Towards Unified 3D Object Detection via Algorithm and Data Unification

TL;DR

Abstract

Towards Unified 3D Object Detection via Algorithm and Data Unification

Authors

TL;DR

Abstract

Table of Contents

Figures (8)