Table of Contents
Fetching ...

Multimodal 3D Object Detection on Unseen Domains

Deepti Hegde, Suhas Lohit, Kuan-Chuan Peng, Michael J. Jones, Vishal M. Patel

TL;DR

The paper addresses domain generalization for 3D object detection under unseen target domains in autonomous driving. It introduces CLIX3D, a framework that blends MSFusion LiDAR-image fusion with region-level supervised contrastive learning to align object features across multiple source domains. Across Lyft, KITTI, Waymo Open, and nuScenes, multimodal, multi-source training improves generalization to unseen domains, with CLIX3D outperforming single-source DG and prior baselines. The training objective combines localization, classification, RPN, and a contrastive term, $L = L_{loc} + L_{cls} + L_{rpn} + L_{con}$, enabling robust, domain-invariant detections suitable for real-world deployment.

Abstract

LiDAR datasets for autonomous driving exhibit biases in properties such as point cloud density, range, and object dimensions. As a result, object detection networks trained and evaluated in different environments often experience performance degradation. Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. However, in the real world, the exact conditions of deployment and access to samples representative of the test dataset may be unavailable while training. We argue that the more realistic and challenging formulation is to require robustness in performance to unseen target domains. We propose to address this problem in a two-pronged manner. First, we leverage paired LiDAR-image data present in most autonomous driving datasets to perform multimodal object detection. We suggest that working with multimodal features by leveraging both images and LiDAR point clouds for scene understanding tasks results in object detectors more robust to unseen domain shifts. Second, we train a 3D object detector to learn multimodal object features across different distributions and promote feature invariance across these source domains to improve generalizability to unseen target domains. To this end, we propose CLIX$^\text{3D}$, a multimodal fusion and supervised contrastive learning framework for 3D object detection that performs alignment of object features from same-class samples of different domains while pushing the features from different classes apart. We show that CLIX$^\text{3D}$ yields state-of-the-art domain generalization performance under multiple dataset shifts.

Multimodal 3D Object Detection on Unseen Domains

TL;DR

The paper addresses domain generalization for 3D object detection under unseen target domains in autonomous driving. It introduces CLIX3D, a framework that blends MSFusion LiDAR-image fusion with region-level supervised contrastive learning to align object features across multiple source domains. Across Lyft, KITTI, Waymo Open, and nuScenes, multimodal, multi-source training improves generalization to unseen domains, with CLIX3D outperforming single-source DG and prior baselines. The training objective combines localization, classification, RPN, and a contrastive term, , enabling robust, domain-invariant detections suitable for real-world deployment.

Abstract

LiDAR datasets for autonomous driving exhibit biases in properties such as point cloud density, range, and object dimensions. As a result, object detection networks trained and evaluated in different environments often experience performance degradation. Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. However, in the real world, the exact conditions of deployment and access to samples representative of the test dataset may be unavailable while training. We argue that the more realistic and challenging formulation is to require robustness in performance to unseen target domains. We propose to address this problem in a two-pronged manner. First, we leverage paired LiDAR-image data present in most autonomous driving datasets to perform multimodal object detection. We suggest that working with multimodal features by leveraging both images and LiDAR point clouds for scene understanding tasks results in object detectors more robust to unseen domain shifts. Second, we train a 3D object detector to learn multimodal object features across different distributions and promote feature invariance across these source domains to improve generalizability to unseen target domains. To this end, we propose CLIX, a multimodal fusion and supervised contrastive learning framework for 3D object detection that performs alignment of object features from same-class samples of different domains while pushing the features from different classes apart. We show that CLIX yields state-of-the-art domain generalization performance under multiple dataset shifts.
Paper Structure (15 sections, 2 equations, 4 figures, 5 tables)

This paper contains 15 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (Left) Overview of the contrastive learning framework of CLIX3D. We use multi-source training and supervised contrastive learning between region level features to improve robustness of 3D LiDAR and LiDAR-image based object detection networks. (Middle-left) Comparison of 3D detection precision performance of CLIX3D against LiDAR-only detection and SOTA fusion methods. (Far-ight) Comparison of CLIX3D's robustness to unseen distributions with SOTA domain generalization work.
  • Figure 2: LiDAR and image scene examples from the KITTI KITTI, Waymo waymo nuScenes nuscenes2019 datasets in different environmental conditions, which are listed at the beginning of each row. Images are particularly prone to illumination conditions, while LiDAR scenes differ in density between datasets. The first column shows the image scene, the second column is a bird's-eye-view (BEV) of the LiDAR scene and the last row shows the LiDAR scene projected to the front image.
  • Figure 3: Description of the proposed CLIX3D for generalizing 3D object detectors to unseen target domains. Samples from multiple diverse source domains are used to train the object detectors. Following multi-stage deep feature fusion in the feature extraction backbone, supervised contrastive learning is applied on ROI features obtained after the pooling step which encourages domain invariance. As illustrated on the right side, region features that belong to the same class, but from different domains are encouraged to be closer together in feature space, while those that belong to different classes are pushed apart.
  • Figure 4: A qualitative comparison of the detection results of Part-$A^2$ trained for the domain shift scenario $\text{Waymo, nuScenes}\rightarrow \text{KITTI}$. Ground truth bounding boxes for the "Car" category are in green, in magenta for the "Pedestrian" category, and in cyan for the "Cyclist" category. Predictions are in red. (Best viewed zoomed in and in color).