Table of Contents
Fetching ...

CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection

Gyusam Chang, Wonseok Roh, Sujin Jang, Dongwook Lee, Daehyun Ji, Gyeongrok Oh, Jinsun Park, Jinkyu Kim, Sangpil Kim

TL;DR

The paper tackles the problem of domain generalization for LiDAR-based 3D object detection under unlabeled target domains. It introduces CMDA, which fuses Cross-Modality Knowledge Interaction (CMKI) with a Cross-Domain Adversarial Network (CDAN) to learn domain-invariant BEV features by transferring semantic cues from camera images to LiDAR BEV and by adversarial self-training with point-mixup and entropy regularization, as formalized in $\mathcal{L}_{cmki}$, $\mathcal{L}_{d}$, and $\mathcal{L}_{ent}$. Key contributions include the BEV-aligned cross-modal learning objective $\mathcal{L}_{cmki}$, the cross-domain mix-up strategy, the gradient-reversal discriminator with $\mathcal{L}_{d}$ and $\mathcal{L}_{ent}$, and comprehensive experiments on nuScenes, Waymo, and KITTI showing state-of-the-art UDA performance. This work advances robust 3D perception under distribution shifts, reducing reliance on target labels and improving practical deployment for autonomous systems, especially across diverse sensing conditions and geographies.

Abstract

Recent LiDAR-based 3D Object Detection (3DOD) methods show promising results, but they often do not generalize well to target domains outside the source (or training) data distribution. To reduce such domain gaps and thus to make 3DOD models more generalizable, we introduce a novel unsupervised domain adaptation (UDA) method, called CMDA, which (i) leverages visual semantic cues from an image modality (i.e., camera images) as an effective semantic bridge to close the domain gap in the cross-modal Bird's Eye View (BEV) representations. Further, (ii) we also introduce a self-training-based learning strategy, wherein a model is adversarially trained to generate domain-invariant features, which disrupt the discrimination of whether a feature instance comes from a source or an unseen target domain. Overall, our CMDA framework guides the 3DOD model to generate highly informative and domain-adaptive features for novel data distributions. In our extensive experiments with large-scale benchmarks, such as nuScenes, Waymo, and KITTI, those mentioned above provide significant performance gains for UDA tasks, achieving state-of-the-art performance.

CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection

TL;DR

The paper tackles the problem of domain generalization for LiDAR-based 3D object detection under unlabeled target domains. It introduces CMDA, which fuses Cross-Modality Knowledge Interaction (CMKI) with a Cross-Domain Adversarial Network (CDAN) to learn domain-invariant BEV features by transferring semantic cues from camera images to LiDAR BEV and by adversarial self-training with point-mixup and entropy regularization, as formalized in , , and . Key contributions include the BEV-aligned cross-modal learning objective , the cross-domain mix-up strategy, the gradient-reversal discriminator with and , and comprehensive experiments on nuScenes, Waymo, and KITTI showing state-of-the-art UDA performance. This work advances robust 3D perception under distribution shifts, reducing reliance on target labels and improving practical deployment for autonomous systems, especially across diverse sensing conditions and geographies.

Abstract

Recent LiDAR-based 3D Object Detection (3DOD) methods show promising results, but they often do not generalize well to target domains outside the source (or training) data distribution. To reduce such domain gaps and thus to make 3DOD models more generalizable, we introduce a novel unsupervised domain adaptation (UDA) method, called CMDA, which (i) leverages visual semantic cues from an image modality (i.e., camera images) as an effective semantic bridge to close the domain gap in the cross-modal Bird's Eye View (BEV) representations. Further, (ii) we also introduce a self-training-based learning strategy, wherein a model is adversarially trained to generate domain-invariant features, which disrupt the discrimination of whether a feature instance comes from a source or an unseen target domain. Overall, our CMDA framework guides the 3DOD model to generate highly informative and domain-adaptive features for novel data distributions. In our extensive experiments with large-scale benchmarks, such as nuScenes, Waymo, and KITTI, those mentioned above provide significant performance gains for UDA tasks, achieving state-of-the-art performance.
Paper Structure (21 sections, 7 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 7 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: An overview of our architecture. Our framework consists of two main steps. (a) Cross-Modal LiDAR Encoder Pre-Training: aligning spatially paired image-based and LiDAR-based BEV representations for cross-modal BEV feature learning. This allows the LiDAR encoder to learn modality-specific visual semantic information from the image features. (b) Cross-Domain LiDAR-Only Self-Training: learning domain-invariant features through adversarial regularization of the LiDAR encoder, ultimately reducing the representation gap between source and target domains.
  • Figure 2: An overview of our Images-to-BEV View Transform module. We first transform multi-view images into voxel-wise representations $F_{I}^{vox}$ by simultaneously leveraging $F_I$ and $D_{depth}$, yielding a BEV representation $F_{I}^{bev}$.
  • Figure 3: An overview of our cross-domain self-training step. Given a mixed point scene (source-domain points replace target-domain points in a randomly chosen region), our domain discriminator is adversarially trained to classify whether an object is from source or target domains.
  • Figure 4: t-SNE van2008visualizing visualizations of source (S, red) and target (T, blue) domains' LiDAR-based BEV feature distribution.
  • Figure 5: Statistical analyses of detection results: (left) perception capacity for the number of points per object and (right) accuracy comparison across the range.
  • ...and 1 more figures