Table of Contents
Fetching ...

WLST: Weak Labels Guided Self-training for Weakly-supervised Domain Adaptation on 3D Object Detection

Tsung-Lin Tsou, Tsung-Han Wu, Winston H. Hsu

TL;DR

This work tackles practical weakly-supervised domain adaptation for 3D object detection by formulating the problem as adapting from a labeled source domain $D_s$ to a weakly-labeled target domain $D_t$ and proposing WLST, a three-stage framework that couples a 3D detector with an autolabeler to generate robust pseudo labels from 2D weak labels. A key novelty is the consistency fusion strategy, which jointly exploits geometric consistency and cross-modality cues to select high-quality pseudo labels from both modalities, enabling effective self-training under domain shifts. The approach is demonstrated on three benchmarks (Waymo, nuScenes, KITTI) and consistently outperforms prior unsupervised and weakly-supervised DA methods, significantly closing the gap to the fully supervised Oracle while remaining detector- and autolabeler-agnostic. The work offers a cost-effective path toward deploying robust 3D detectors in real-world, cross-domain autonomous driving scenarios, leveraging weak target annotations and cross-modal supervision.

Abstract

In the field of domain adaptation (DA) on 3D object detection, most of the work is dedicated to unsupervised domain adaptation (UDA). Yet, without any target annotations, the performance gap between the UDA approaches and the fully-supervised approach is still noticeable, which is impractical for real-world applications. On the other hand, weakly-supervised domain adaptation (WDA) is an underexplored yet practical task that only requires few labeling effort on the target domain. To improve the DA performance in a cost-effective way, we propose a general weak labels guided self-training framework, WLST, designed for WDA on 3D object detection. By incorporating autolabeler, which can generate 3D pseudo labels from 2D bounding boxes, into the existing self-training pipeline, our method is able to generate more robust and consistent pseudo labels that would benefit the training process on the target domain. Extensive experiments demonstrate the effectiveness, robustness, and detector-agnosticism of our WLST framework. Notably, it outperforms previous state-of-the-art methods on all evaluation tasks.

WLST: Weak Labels Guided Self-training for Weakly-supervised Domain Adaptation on 3D Object Detection

TL;DR

This work tackles practical weakly-supervised domain adaptation for 3D object detection by formulating the problem as adapting from a labeled source domain to a weakly-labeled target domain and proposing WLST, a three-stage framework that couples a 3D detector with an autolabeler to generate robust pseudo labels from 2D weak labels. A key novelty is the consistency fusion strategy, which jointly exploits geometric consistency and cross-modality cues to select high-quality pseudo labels from both modalities, enabling effective self-training under domain shifts. The approach is demonstrated on three benchmarks (Waymo, nuScenes, KITTI) and consistently outperforms prior unsupervised and weakly-supervised DA methods, significantly closing the gap to the fully supervised Oracle while remaining detector- and autolabeler-agnostic. The work offers a cost-effective path toward deploying robust 3D detectors in real-world, cross-domain autonomous driving scenarios, leveraging weak target annotations and cross-modal supervision.

Abstract

In the field of domain adaptation (DA) on 3D object detection, most of the work is dedicated to unsupervised domain adaptation (UDA). Yet, without any target annotations, the performance gap between the UDA approaches and the fully-supervised approach is still noticeable, which is impractical for real-world applications. On the other hand, weakly-supervised domain adaptation (WDA) is an underexplored yet practical task that only requires few labeling effort on the target domain. To improve the DA performance in a cost-effective way, we propose a general weak labels guided self-training framework, WLST, designed for WDA on 3D object detection. By incorporating autolabeler, which can generate 3D pseudo labels from 2D bounding boxes, into the existing self-training pipeline, our method is able to generate more robust and consistent pseudo labels that would benefit the training process on the target domain. Extensive experiments demonstrate the effectiveness, robustness, and detector-agnosticism of our WLST framework. Notably, it outperforms previous state-of-the-art methods on all evaluation tasks.
Paper Structure (24 sections, 2 equations, 5 figures, 8 tables)

This paper contains 24 sections, 2 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Our WLST framework is composed of three stages. (a) Pre-train 3D detector and autolabeler on the source data. (see Sec. \ref{['subsubsec:Model Pre-training']}) (b) Generate high-quality pseudo labels by our consistency fusion strategy on the target data. (see Sec. \ref{['subsubsec:Pseudo-label Generation']}) (c) Re-train 3D detector and autolabeler on the pseudo-labeled target data. (see Sec. \ref{['subsubsec:Model Re-training']})
  • Figure 2: Left: Visualization of false positive (FP), false negative (FN), and true positive (TP) boxes of the pseudo labels. Right: According to the projective geometry, frustums can be generated by utilizing their 2D bounding boxes as the projection source and they define the 3D search space for pseudo labels, which manifests that an object should be located in the frustum corresponding to its 2D bounding box. In other words, when we re-project the pseudo labels into 2D image plane, (a) A TP box tends to have a higher IoU with its corresponding 2D bounding box. (b) A FP box does not have corresponding 2D bounding box and it is less likely to have a decent IoU with any 2D bounding box. (c) We can also learn that an object should exist in the frustum corresponding to a FN box.
  • Figure 3: Visualization of pseudo labels $[\hat{L}_{det}^i]_1$ and $[\hat{L}_{aut}^i]_1$ generated by 3D detector and autolabeler respectively. We observed that Top:$[\hat{L}_{aut}^i]_1$ has higher precision. (a) It is less likely to predict extra FP boxes. (b) It is able to predict the heights of objects more precisely. Bottom:$[\hat{L}_{det}^i]_1$ has higher recall. (c, d) It has a better understanding of the correlation between objects, e.g. a line of vehicles.
  • Figure 4: Our proposed autolabeler designed for DA. The model takes the frustum points in the camera coordinate as input and outputs a 3D pseudo label. ($M_{seg}$ denotes foreground segmentation network, and $M_{reg}$ denotes box regression network.)
  • Figure 5: Qualitative Analysis on Pseudo Labels over Time. Comparison between our WLST and the state-of-the-art UDA method ST3D on the Waymo $\rightarrow$ KITTI task. We utilize Recall with IoU $>$ 0.7 and Precision with IoU $>$ 0.7 as our metrics to assess the quality of pseudo labels.