Table of Contents
Fetching ...

Improving Generalization Ability for 3D Object Detection by Learning Sparsity-invariant Features

Hsin-Cheng Lu, Chung-Yi Lin, Winston H. Hsu

TL;DR

The paper tackles the generalization gap in LiDAR-based 3D object detection when deployed to unseen domains with different sensor configurations and scene distributions. It introduces sparsity-invariant feature learning by downsampling source point clouds to varied beam densities using detector-driven confidence, implemented within a teacher-student BEV framework that applies Feature Content Alignment ($L_{FCA}$) and Graph-based Embedding Relationship Alignment (GERA) to learn domain-agnostic representations. The approach optimizes $ ext{L}_{ ext{overall}} = ext{L}_{ ext{det}} + oldsymbol{b1} ext{L}_{ ext{FCA}} + oldsymbol{b2} ext{L}_{ ext{GERA}}$, and leverages a Gromov-Wasserstein-based loss to preserve high-level proposal relationships across densities. Experiments on Waymo, KITTI, and nuScenes show superior generalization to unseen domains and compatibility with domain adaptation methods, sometimes matching or exceeding target-domain baselines, thereby reducing reliance on multi-domain labeled data. This work advances practical robustness for autonomous driving by enabling a single-domain-trained detector to operate effectively across diverse LiDAR configurations and environments.

Abstract

In autonomous driving, 3D object detection is essential for accurately identifying and tracking objects. Despite the continuous development of various technologies for this task, a significant drawback is observed in most of them-they experience substantial performance degradation when detecting objects in unseen domains. In this paper, we propose a method to improve the generalization ability for 3D object detection on a single domain. We primarily focus on generalizing from a single source domain to target domains with distinct sensor configurations and scene distributions. To learn sparsity-invariant features from a single source domain, we selectively subsample the source data to a specific beam, using confidence scores determined by the current detector to identify the density that holds utmost importance for the detector. Subsequently, we employ the teacher-student framework to align the Bird's Eye View (BEV) features for different point clouds densities. We also utilize feature content alignment (FCA) and graph-based embedding relationship alignment (GERA) to instruct the detector to be domain-agnostic. Extensive experiments demonstrate that our method exhibits superior generalization capabilities compared to other baselines. Furthermore, our approach even outperforms certain domain adaptation methods that can access to the target domain data.

Improving Generalization Ability for 3D Object Detection by Learning Sparsity-invariant Features

TL;DR

The paper tackles the generalization gap in LiDAR-based 3D object detection when deployed to unseen domains with different sensor configurations and scene distributions. It introduces sparsity-invariant feature learning by downsampling source point clouds to varied beam densities using detector-driven confidence, implemented within a teacher-student BEV framework that applies Feature Content Alignment () and Graph-based Embedding Relationship Alignment (GERA) to learn domain-agnostic representations. The approach optimizes , and leverages a Gromov-Wasserstein-based loss to preserve high-level proposal relationships across densities. Experiments on Waymo, KITTI, and nuScenes show superior generalization to unseen domains and compatibility with domain adaptation methods, sometimes matching or exceeding target-domain baselines, thereby reducing reliance on multi-domain labeled data. This work advances practical robustness for autonomous driving by enabling a single-domain-trained detector to operate effectively across diverse LiDAR configurations and environments.

Abstract

In autonomous driving, 3D object detection is essential for accurately identifying and tracking objects. Despite the continuous development of various technologies for this task, a significant drawback is observed in most of them-they experience substantial performance degradation when detecting objects in unseen domains. In this paper, we propose a method to improve the generalization ability for 3D object detection on a single domain. We primarily focus on generalizing from a single source domain to target domains with distinct sensor configurations and scene distributions. To learn sparsity-invariant features from a single source domain, we selectively subsample the source data to a specific beam, using confidence scores determined by the current detector to identify the density that holds utmost importance for the detector. Subsequently, we employ the teacher-student framework to align the Bird's Eye View (BEV) features for different point clouds densities. We also utilize feature content alignment (FCA) and graph-based embedding relationship alignment (GERA) to instruct the detector to be domain-agnostic. Extensive experiments demonstrate that our method exhibits superior generalization capabilities compared to other baselines. Furthermore, our approach even outperforms certain domain adaptation methods that can access to the target domain data.

Paper Structure

This paper contains 19 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Traditional 3D detectors, trained directly on the source domain, often experience a significant performance drop when the point clouds become sparser. In contrast, our method empowers the 3D detector to learn sparsity-invariant features through training with our proposed augmentation and feature alignment techniques. Note that in the detection results, blue boxes represent the ground truth annotations, while the green boxes indicate the predicted boxes.
  • Figure 2: The overview of our proposed method. Initially, we downsample original point clouds into various densities and then select one based on detector confidence. To learn domain-agnostic features, feature content alignment (FCA) are applied to BEV features for each shared Region of Interest (ROI) to align low-level content consistency. Subsequently, encoded features are constrained by graph-based embedding relationship alignment (GERA) to maintain high-level relationship consistency. The blue and red flows illustrate the processing pipelines for the teacher and student models, respectively.
  • Figure 3: Illustration of our confidence-based selection strategy. $P^{s}$ and $P^{a'}$ denote the original point clouds and the final augmented point clouds, respectively, while $\{P^{a}_{i}\}_{i=1}^{N}$ indicates $N$ augmented point clouds with different beam types.
  • Figure 4: Performance on KITTI dataset geiger2012we across various point cloud densities. Source Only indicates the direct evaluation of the model trained on the original dataset (64-beam) on other low-beam validation sets. Ours denotes that the detector is trained using the proposed method.