CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection

Zhaonian Kuang; Rui Ding; Haotian Wang; Xinhu Zheng; Meng Yang; Gang Hua

CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection

Zhaonian Kuang, Rui Ding, Haotian Wang, Xinhu Zheng, Meng Yang, Gang Hua

TL;DR

CoIn3D is proposed, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones and explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively.

Abstract

Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plücker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.

CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection

TL;DR

Abstract

Paper Structure (43 sections, 7 equations, 9 figures, 13 tables)

This paper contains 43 sections, 7 equations, 9 figures, 13 tables.

Introduction
Related Work
Multi-Camera 3D Object Detection
Camera Configuration Generalization
3D Gaussian splatting
Revisit cameras configuration in MC3D
MC3D task
Intrinsic revisiting
Extrinsic revisiting
Array revisiting
Methodology
Spatial-aware feature modulation
Inverse focal map
Ground depth and gradient map
Plücker raymap
...and 28 more sections

Figures (9)

Figure 1: CoIn3D effectively enables model transferability from source configuration A to unseen target configurations B, C, ..., covering variations in intrinsics, extrinsics, and array layouts. Our framework can be applied to three dominant MC3D paradigms, represented by BEVDepth li2023bevdepth, BEVFormer li2024bevformer, and PETR liu2022petr.
Figure 2: Illustration of our CoIn3D framework for generalizable MC3D across multi-camera configurations. During training, we apply the camera-aware data augmentation (CDA) to generate $N$ images with randomly sampled camera configurations, followed by spatial-aware feature modulation (SFM). SFM modulates activations using an inverse focal map to obtain focal-invariant features, then projects prior maps (ground depth, gradient map, and Plücker raymap) to create spatial embeddings, which are added to the focal-invariant features. These maps are concatenated with image input and features to provide raw priors. Finally, the spatial-aware features can be easily integrated into MC3D for downstream tasks. During inference, we use raw images and apply spatial-aware modulation to generalize to new camera configurations. Our framework is applicable to dominant MC3D paradigms, including bottom-up BEV, top-down BEV, and sparse-queries.
Figure 3: Illustration of the spatial discrepancies under different camera configurations: (a) focal-ambiguity for a same object; (b) ground depth and depth increasing rate under different camera heights; (c) the scene structure (1st row), depth distribution (2nd row), and Plücker raymap (3rd row) for surround-view cameras.
Figure 4: Illustration of the training-free ego-centric Gaussians construction pipeline. We transform reconstructed texture point clouds into Gaussian representations using predefined parameters.
Figure 5: The detailed ego-centric Gaussian construction pipeline.
...and 4 more figures

CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection

TL;DR

Abstract

CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (9)