Table of Contents
Fetching ...

SC3D: Label-Efficient Outdoor 3D Object Detection via Single Click Annotation

Qiming Xia, Hongwei Lin, Wei Ye, Hai Wu, Yadan Luo, Cheng Wang, Chenglu Wen

TL;DR

Experimental results on the widely used nuScenes and KITTI datasets demonstrate that the SC3D with only coarse clicks, which requires only 0.2% annotation cost, achieves state-of-the-art performance compared to weakly-supervised 3D detection methods.

Abstract

LiDAR-based outdoor 3D object detection has received widespread attention. However, training 3D detectors from the LiDAR point cloud typically relies on expensive bounding box annotations. This paper presents SC3D, an innovative label-efficient method requiring only a single coarse click on the bird's eye view of the 3D point cloud for each frame. A key challenge here is the absence of complete geometric descriptions of the target objects from such simple click annotations. To address this issue, our proposed SC3D adopts a progressive pipeline. Initially, we design a mixed pseudo-label generation module that expands limited click annotations into a mixture of bounding box and semantic mask supervision. Next, we propose a mix-supervised teacher model, enabling the detector to learn mixed supervision information. Finally, we introduce a mixed-supervised student network that leverages the teacher model's generalization ability to learn unclicked instances.Experimental results on the widely used nuScenes and KITTI datasets demonstrate that our SC3D with only coarse clicks, which requires only 0.2% annotation cost, achieves state-of-the-art performance compared to weakly-supervised 3D detection methods.The code will be made publicly available.

SC3D: Label-Efficient Outdoor 3D Object Detection via Single Click Annotation

TL;DR

Experimental results on the widely used nuScenes and KITTI datasets demonstrate that the SC3D with only coarse clicks, which requires only 0.2% annotation cost, achieves state-of-the-art performance compared to weakly-supervised 3D detection methods.

Abstract

LiDAR-based outdoor 3D object detection has received widespread attention. However, training 3D detectors from the LiDAR point cloud typically relies on expensive bounding box annotations. This paper presents SC3D, an innovative label-efficient method requiring only a single coarse click on the bird's eye view of the 3D point cloud for each frame. A key challenge here is the absence of complete geometric descriptions of the target objects from such simple click annotations. To address this issue, our proposed SC3D adopts a progressive pipeline. Initially, we design a mixed pseudo-label generation module that expands limited click annotations into a mixture of bounding box and semantic mask supervision. Next, we propose a mix-supervised teacher model, enabling the detector to learn mixed supervision information. Finally, we introduce a mixed-supervised student network that leverages the teacher model's generalization ability to learn unclicked instances.Experimental results on the widely used nuScenes and KITTI datasets demonstrate that our SC3D with only coarse clicks, which requires only 0.2% annotation cost, achieves state-of-the-art performance compared to weakly-supervised 3D detection methods.The code will be made publicly available.
Paper Structure (34 sections, 5 equations, 5 figures, 7 tables)

This paper contains 34 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: (a) Unlike traditional costly box annotations, coarse-click alternatives require only quick clicks on single object in the 2D BEV plane, yet offer limited supervision; (b) demonstrates a comparison with existing label-efficient methods on the KITTI dataset, where we reduce the annotation cost to 0.2%, enhancing the previous scheme's labeling efficiency by over 10 times, while still maintaining the performance of the detector.
  • Figure 2: The overview of proposed SC3D. (a) Initially, a novel motion state classification strategy is introduced, followed by the generation of box-level pseudo-label $\textcolor{orange}{\mathbf{L}_{b}}$ and mask-level pseudo-label $\textcolor{rgb(139,0,0)}{\mathbf{L}_{m}}$, utilizing the Click2Mask and Click2Box modules, respectively. (b) With the mixed pseudo-labels generated by stage (a), train the mixed-supervised teacher detector and then update the mask-level supervision to box-level based on high-confidence predictions. (c) Utilizing the generalization of the teacher network to produce mixed pseudo-labels for unlabeled instances, further enhancing the performance of the mixed-supervised student network.
  • Figure 3: The difference in point distribution between dynamic and static objects in consecutive frames are as follows: dynamic objects have a rapid change in local point density over time; static objects have a stable local point density.
  • Figure 4: Training strategies for the mixed-supervised student network.
  • Figure 5: Prediction results and the quality of pseudo-labels across various iterative rounds. (a) and (b) represent the 3D and BEV results for cars at IOU thresholds of 0.5 and 0.7, respectively; (c) compare the pseudo-labels with the ground truth across each iterative round.