Table of Contents
Fetching ...

Every Dataset Counts: Scaling up Monocular 3D Object Detection with Joint Datasets Training

Fulong Ma, Xiaoyang Yan, Guoyang Zhao, Xiaojie Xu, Yuxuan Liu, Jun Ma, Ming Liu

TL;DR

This work tackles the expense and scarcity of 3D annotations by enabling scalable monocular 3D object detection through joint dataset training and 2D-label supervision. It introduces a camera-aware MonoFlex-based baseline, a selective training scheme for heterogeneous datasets, and a pseudo 3D training pipeline that derives 3D supervision from 2D labels, enabling fine-tuning on targets lacking 3D data. The approach yields strong generalization across KITTI, Cityscapes, and other public datasets, outperforming zero-shot baselines and approaching weakly supervised methods while avoiding LiDAR data. The framework promises practical impact for deploying monocular 3D detectors in new environments with minimal annotation cost, validated on a broad suite of autonomous-driving datasets.

Abstract

Monocular 3D object detection plays a crucial role in autonomous driving. However, existing monocular 3D detection algorithms depend on 3D labels derived from LiDAR measurements, which are costly to acquire for new datasets and challenging to deploy in novel environments. Specifically, this study investigates the pipeline for training a monocular 3D object detection model on a diverse collection of 3D and 2D datasets. The proposed framework comprises three components: (1) a robust monocular 3D model capable of functioning across various camera settings, (2) a selective-training strategy to accommodate datasets with differing class annotations, and (3) a pseudo 3D training approach using 2D labels to enhance detection performance in scenes containing only 2D labels. With this framework, we could train models on a joint set of various open 3D/2D datasets to obtain models with significantly stronger generalization capability and enhanced performance on new dataset with only 2D labels. We conduct extensive experiments on KITTI/nuScenes/ONCE/Cityscapes/BDD100K datasets to demonstrate the scaling ability of the proposed method.

Every Dataset Counts: Scaling up Monocular 3D Object Detection with Joint Datasets Training

TL;DR

This work tackles the expense and scarcity of 3D annotations by enabling scalable monocular 3D object detection through joint dataset training and 2D-label supervision. It introduces a camera-aware MonoFlex-based baseline, a selective training scheme for heterogeneous datasets, and a pseudo 3D training pipeline that derives 3D supervision from 2D labels, enabling fine-tuning on targets lacking 3D data. The approach yields strong generalization across KITTI, Cityscapes, and other public datasets, outperforming zero-shot baselines and approaching weakly supervised methods while avoiding LiDAR data. The framework promises practical impact for deploying monocular 3D detectors in new environments with minimal annotation cost, validated on a broad suite of autonomous-driving datasets.

Abstract

Monocular 3D object detection plays a crucial role in autonomous driving. However, existing monocular 3D detection algorithms depend on 3D labels derived from LiDAR measurements, which are costly to acquire for new datasets and challenging to deploy in novel environments. Specifically, this study investigates the pipeline for training a monocular 3D object detection model on a diverse collection of 3D and 2D datasets. The proposed framework comprises three components: (1) a robust monocular 3D model capable of functioning across various camera settings, (2) a selective-training strategy to accommodate datasets with differing class annotations, and (3) a pseudo 3D training approach using 2D labels to enhance detection performance in scenes containing only 2D labels. With this framework, we could train models on a joint set of various open 3D/2D datasets to obtain models with significantly stronger generalization capability and enhanced performance on new dataset with only 2D labels. We conduct extensive experiments on KITTI/nuScenes/ONCE/Cityscapes/BDD100K datasets to demonstrate the scaling ability of the proposed method.
Paper Structure (16 sections, 5 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 5 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Our method mainly consists of three parts. The first part is the camera parameter adaptation module, which handles different camera parameters to mitigate their impact. The second part is multi-dataset joint training, where we pre-train the model using as many datasets as possible to enhance its feature extraction capability. The third part involves leveraging 2D annotation information to assist the training of the 3D detection model, enabling good detection performance even in the absence of 3D annotation information.
  • Figure 2: Visualizations of the detection results of our method on five different datasets: BDD100K, Cityscapes, KITTI, nuScenes, and ONCE.
  • Figure 3: This figure illustrates the training process of our proposed method. It shows how the pre-trained model's inference, combined with the 2D annotations from the dataset, facilitates the training of a 3D detection model on datasets that lack 3D training labels.
  • Figure 4: The figure depicts the training label update process. In the left image, the pre-trained 3D detection model's predictions on new data are shown, which may include some erroneous detections, as indicated by the green boxes. The middle image illustrates the process of identifying and filtering out these erroneous detections, marking them in gray based on the matching results. The right image represents the reconstruction of the ground truth heatmap using the pseudo 3D labels.
  • Figure 5: The qualitative results on the Cityscapes dataset. The leftmost column contains the original images, the middle column displays the zero-shot results, and the rightmost column shows the results obtained using our method. The pink boxes represent 2D detection results.