MS3D++: Ensemble of Experts for Multi-Source Unsupervised Domain Adaptation in 3D Object Detection

Darren Tsai; Julie Stephany Berrio; Mao Shan; Eduardo Nebot; Stewart Worrall

MS3D++: Ensemble of Experts for Multi-Source Unsupervised Domain Adaptation in 3D Object Detection

Darren Tsai, Julie Stephany Berrio, Mao Shan, Eduardo Nebot, Stewart Worrall

TL;DR

The paper tackles the domain gap in 3D object detection by introducing MS3D++, a multi-source self-training framework that ensembles pre-trained detectors from diverse source domains, leverages short-sequence point cloud accumulation, and employs temporal refinement to generate high-quality pseudo-labels. Central to MS3D++ are Kernel Density Estimation Box Fusion (KBF) for robust fusion, Varied Multi-Frame Inference (VMFI) to broaden proposal coverage, detector weighing to counteract cross-domain biases, and multi-stage self-training to progressively improve recall while preserving precision. Experimental results on Waymo, nuScenes, and Lyft show state-of-the-art BEV and competitive 3D detection performance using pseudo-labels comparable to human-annotated labels, demonstrating practical domain adaptation without manual labeling or architecture changes. The framework is designed to be flexible and extensible, enabling easy integration with existing detectors and data augmentations, with future directions including multi-modal detectors and active learning.

Abstract

Deploying 3D detectors in unfamiliar domains has been demonstrated to result in a significant 70-90% drop in detection rate due to variations in lidar, geography, or weather from their training dataset. This domain gap leads to missing detections for densely observed objects, misaligned confidence scores, and increased high-confidence false positives, rendering the detector highly unreliable. To address this, we introduce MS3D++, a self-training framework for multi-source unsupervised domain adaptation in 3D object detection. MS3D++ generates high-quality pseudo-labels, allowing 3D detectors to achieve high performance on a range of lidar types, regardless of their density. Our approach effectively fuses predictions of an ensemble of multi-frame pre-trained detectors from different source domains to improve domain generalization. We subsequently refine predictions temporally to ensure temporal consistency in box localization and object classification. Furthermore, we present an in-depth study into the performance and idiosyncrasies of various 3D detector components in a cross-domain context, providing valuable insights for improved cross-domain detector ensembling. Experimental results on Waymo, nuScenes and Lyft demonstrate that detectors trained with MS3D++ pseudo-labels achieve state-of-the-art performance, comparable to training with human-annotated labels in Bird's Eye View (BEV) evaluation for both low and high density lidar. Code is available at https://github.com/darrenjkt/MS3D

MS3D++: Ensemble of Experts for Multi-Source Unsupervised Domain Adaptation in 3D Object Detection

TL;DR

Abstract

Paper Structure (39 sections, 1 equation, 8 figures, 15 tables)

This paper contains 39 sections, 1 equation, 8 figures, 15 tables.

Introduction
Related Work
3D Object Detection
Point cloud representations
Detection head
Unsupervised Domain Adaptation
Quantifying Cross-Domain 3D object detection
Experiment Setup
3D Detector Selection
Dataset Selection and Evaluation
Training with Accumulated Point Clouds
Varying Multi-frame Inference
Cross-Domain Results for Each Target Domain
MS3D++
Overview
...and 24 more sections

Figures (8)

Figure 1: Domain generalization for unseen target domains can be improved by ensembling multiple pre-trained detectors from various source domains. Figure shows detector predictions on a point cloud from the Waymo dataset sun2020waymo in rainy weather. Top: PV-RCNN++ shi2023pv++ trained on Lyftwoven2019lyft, Middle: Ensemble of PV-RCNN++ and VoxelRCNNdeng2020voxelrcnn trained on both Lyft and nuScenescaesar2020nuscenes, Bottom: Fused detections with our Kernel-density estimation Box Fusion (KBF).
Figure 2: Our multi-source self-training framework, MS3D++. Given a set of $M$ pre-trained, multi-frame 3D detectors from multiple source domains $\textbf{D}_{\text{S,m}}$ where $\text{m}=1,2,...,\text{M}$, we generate predictions for a varying number of accumulated point cloud frames which are fused with KBF and tracked. Our temporal refinement uses object characteristics to ascertain class labels and refine bounding box localization. The set of pseudo-labels is iteratively improved with each update, utilizing the re-trained detectors from the previous round in its ensemble.
Figure 3: 3D Detectors tested on cross-domain data encounter various issues. From left to right: (1) Scan pattern discrepancy causes densely observed vehicles and pedestrians to be missed, (2) Weather artefacts such as reflections on the ground due to rain, appear different across various scan patterns, causing missed detections or false positives, (3) Sparse lidar is challenging for box height estimation, (4) Poor (or lack of) motion compensation for the ego-vehicle causes artefacts, inaccurate labels, and therefore, prediction "errors". In the image, the upper predicted box may not meet the IoU=0.7 threshold for a true positive detection. Predictions for image (1-4) were made with PV-RCNN++ (centerhead) and trained on nuScenes (1), Lyft (2) and Waymo (3-4). We elaborate on these in \ref{['sec:study_cross_domain_results']}.
Figure 4: Ensembling with Varied Multi-frame Inference (VMFI) obtains an extended range of detections compared to using 16-frames for inference. This could be attributed to the exacerbated scan pattern domain gap when feeding 16-frames to a pre-trained multi-frame detector of a different source domain.
Figure 5: Multi-stage training process for MS3D++: In the first round of self-training, we employ a range of detectors from various source domains. The pseudo-labels are used to re-train the 3D detectors with multi-frame accumulation. We use the re-trained detectors to create a new ensemble for the next round.
...and 3 more figures

MS3D++: Ensemble of Experts for Multi-Source Unsupervised Domain Adaptation in 3D Object Detection

TL;DR

Abstract

MS3D++: Ensemble of Experts for Multi-Source Unsupervised Domain Adaptation in 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)