Finetuning Pre-trained Model with Limited Data for LiDAR-based 3D Object Detection by Bridging Domain Gaps

Jiyun Jang; Mincheol Chang; Jongwon Park; Jinkyu Kim

Finetuning Pre-trained Model with Limited Data for LiDAR-based 3D Object Detection by Bridging Domain Gaps

Jiyun Jang, Mincheol Chang, Jongwon Park, Jinkyu Kim

TL;DR

This work proposes a novel method, called Domain Adaptive Distill-Tuning (DADT), to adapt a pretrained model with limited target data, retaining its representation power and preventing it from overfitting, and effectively finetunes a pre-trained model, achieving significant gains in accuracy.

Abstract

LiDAR-based 3D object detectors have been largely utilized in various applications, including autonomous vehicles or mobile robots. However, LiDAR-based detectors often fail to adapt well to target domains with different sensor configurations (e.g., types of sensors, spatial resolution, or FOVs) and location shifts. Collecting and annotating datasets in a new setup is commonly required to reduce such gaps, but it is often expensive and time-consuming. Recent studies suggest that pre-trained backbones can be learned in a self-supervised manner with large-scale unlabeled LiDAR frames. However, despite their expressive representations, they remain challenging to generalize well without substantial amounts of data from the target domain. Thus, we propose a novel method, called Domain Adaptive Distill-Tuning (DADT), to adapt a pre-trained model with limited target data (approximately 100 LiDAR frames), retaining its representation power and preventing it from overfitting. Specifically, we use regularizers to align object-level and context-level representations between the pre-trained and finetuned models in a teacher-student architecture. Our experiments with driving benchmarks, i.e., Waymo Open dataset and KITTI, confirm that our method effectively finetunes a pre-trained model, achieving significant gains in accuracy.

Finetuning Pre-trained Model with Limited Data for LiDAR-based 3D Object Detection by Bridging Domain Gaps

TL;DR

Abstract

Paper Structure (13 sections, 7 equations, 7 figures, 5 tables)

This paper contains 13 sections, 7 equations, 7 figures, 5 tables.

INTRODUCTION
RELATED WORK
Self-Supervised Pre-Training in 3D Object Detection
Unsupervised Domain Adaptation in Point Clouds
General Model Finetuning
METHODOLOGY
Problem Statement
Teacher-Student Architecture for Reducing a Density-driven Representational Gap
BEV-based Similarity Losses
EXPERIMENTS
Effect of Finetuning Pre-trained Model with Limited Data
Performance in Continual Training Scenarios
CONCLUSION

Figures (7)

Figure 1: (a) UMAP umap Visualizations. We compare 2D BEV features between (i) the oracle model (a pre-trained model by AD-PT AD-PT finetuned with the whole KITTI kitti training data) and (ii) similarly finetuned models but with smaller datasets. (b) An Overview of Architectures. Conventional finetuning approaches (left) and our proposed Domain Adaptive Distill-Tuning (DADT) approach (right).
Figure 2: Overview of our proposed Domain Adaptive Distill-Tuning (DADT) framework. DADT has a teacher-student architecture for reducing a density-driven representational gap. Downstream dataset $D_d$ is downsampled using Pseudo Low Beam Generation to create a $D^{pseudo}_{d}$ similar to pretrain dataset. To supervise and regularize the student's BEV feature distribution to match the teacher's general representation feature distribution, BEV object similarity loss is used to make the same objects in the teacher and student have similar features, and BEV context similarity loss is used to find the grid-similarity of the objects to highlight the semantic features of the objects in the current scene and make them similar.
Figure 3: UMAP umap Visualizations. We compare representational differences depending on the beam density (i.e., 16, 32, 64-beams). We use KITTI data, downsampling it into a lower-density LIDAR point cloud.
Figure 4: Examples of Detected 3D Objects from (a) baseline model and (b) ours. Red boxes and green boxes denote ground-truth and predicted bounding boxes, respectively.
Figure 5: UMAP Visualizations of the Oracle (a pre-trained backbone finetuned with 100% KITTI training data) and those from our baseline and ours. KITTI validation set is used for visualization.
...and 2 more figures

Finetuning Pre-trained Model with Limited Data for LiDAR-based 3D Object Detection by Bridging Domain Gaps

TL;DR

Abstract

Finetuning Pre-trained Model with Limited Data for LiDAR-based 3D Object Detection by Bridging Domain Gaps

Authors

TL;DR

Abstract

Table of Contents

Figures (7)