Table of Contents
Fetching ...

MV-FCOS3D++: Multi-View Camera-Only 4D Object Detection with Pretrained Monocular Backbones

Tai Wang, Qing Lian, Chenming Zhu, Xinge Zhu, Wenwei Zhang

TL;DR

The paper addresses camera-only 3D object detection by introducing MV-FCOS3D++, a framework that uses a predefined 3D voxel grid to unify multi-view monocular features and applies BEV-based 3D detection. It strengthens the 2D backbone through perspective-view pretraining and leverages a dual-path temporal fusion to incorporate multi-frame cues, including a stereo-aware pathway. Empirical results on the Waymo camera-only track show substantial improvements over baselines, achieving 49.75% mAPL with a single model and placing second without LiDAR depth supervision. The work demonstrates the effectiveness of explicit 3D voxel representations and dual-path temporal fusion for robust camera-only 4D object detection, with code to be released for reproducibility.

Abstract

In this technical report, we present our solution, dubbed MV-FCOS3D++, for the Camera-Only 3D Detection track in Waymo Open Dataset Challenge 2022. For multi-view camera-only 3D detection, methods based on bird-eye-view or 3D geometric representations can leverage the stereo cues from overlapped regions between adjacent views and directly perform 3D detection without hand-crafted post-processing. However, it lacks direct semantic supervision for 2D backbones, which can be complemented by pretraining simple monocular-based detectors. Our solution is a multi-view framework for 4D detection following this paradigm. It is built upon a simple monocular detector FCOS3D++, pretrained only with object annotations of Waymo, and converts multi-view features to a 3D grid space to detect 3D objects thereon. A dual-path neck for single-frame understanding and temporal stereo matching is devised to incorporate multi-frame information. Our method finally achieves 49.75% mAPL with a single model and wins 2nd place in the WOD challenge, without any LiDAR-based depth supervision during training. The code will be released at https://github.com/Tai-Wang/Depth-from-Motion.

MV-FCOS3D++: Multi-View Camera-Only 4D Object Detection with Pretrained Monocular Backbones

TL;DR

The paper addresses camera-only 3D object detection by introducing MV-FCOS3D++, a framework that uses a predefined 3D voxel grid to unify multi-view monocular features and applies BEV-based 3D detection. It strengthens the 2D backbone through perspective-view pretraining and leverages a dual-path temporal fusion to incorporate multi-frame cues, including a stereo-aware pathway. Empirical results on the Waymo camera-only track show substantial improvements over baselines, achieving 49.75% mAPL with a single model and placing second without LiDAR depth supervision. The work demonstrates the effectiveness of explicit 3D voxel representations and dual-path temporal fusion for robust camera-only 4D object detection, with code to be released for reproducibility.

Abstract

In this technical report, we present our solution, dubbed MV-FCOS3D++, for the Camera-Only 3D Detection track in Waymo Open Dataset Challenge 2022. For multi-view camera-only 3D detection, methods based on bird-eye-view or 3D geometric representations can leverage the stereo cues from overlapped regions between adjacent views and directly perform 3D detection without hand-crafted post-processing. However, it lacks direct semantic supervision for 2D backbones, which can be complemented by pretraining simple monocular-based detectors. Our solution is a multi-view framework for 4D detection following this paradigm. It is built upon a simple monocular detector FCOS3D++, pretrained only with object annotations of Waymo, and converts multi-view features to a 3D grid space to detect 3D objects thereon. A dual-path neck for single-frame understanding and temporal stereo matching is devised to incorporate multi-frame information. Our method finally achieves 49.75% mAPL with a single model and wins 2nd place in the WOD challenge, without any LiDAR-based depth supervision during training. The code will be released at https://github.com/Tai-Wang/Depth-from-Motion.
Paper Structure (9 sections, 5 equations, 2 figures, 2 tables)

This paper contains 9 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: An overview of our framework.
  • Figure 2: Our dual-path design for temporal modeling.