Table of Contents
Fetching ...

Are Dense Labels Always Necessary for 3D Object Detection from Point Cloud?

Chenqiang Gao, Chuandong Liu, Jun Shu, Fangcen Liu, Jiang Liu, Luyu Yang, Xinbo Gao, Deyu Meng

TL;DR

This work tackles the high cost of densely annotated 3D bounding boxes for point-cloud object detection by proposing SS3D++, a sparsely-supervised framework that trains detectors using only one annotated object per scene. It iteratively improves detector performance by mining reliable background and confident missing-annotated instances to generate confident fully-annotated scenes for augmentation, while employing a multi-criteria curriculum to manage pseudo-labels. Across KITTI and Waymo, SS3D++ achieves competitive or superior results relative to state-of-the-art weakly/semi-supervised methods, and, in several cases, approaches or matches fully supervised performance with substantially lower annotation cost (approximately 5× on KITTI and 15× on Waymo). The method is detector-agnostic and can leverage unlabeled data to further boost performance, offering a practical path toward scalable 3D detection in autonomous driving settings.

Abstract

Current state-of-the-art (SOTA) 3D object detection methods often require a large amount of 3D bounding box annotations for training. However, collecting such large-scale densely-supervised datasets is notoriously costly. To reduce the cumbersome data annotation process, we propose a novel sparsely-annotated framework, in which we just annotate one 3D object per scene. Such a sparse annotation strategy could significantly reduce the heavy annotation burden, while inexact and incomplete sparse supervision may severely deteriorate the detection performance. To address this issue, we develop the SS3D++ method that alternatively improves 3D detector training and confident fully-annotated scene generation in a unified learning scheme. Using sparse annotations as seeds, we progressively generate confident fully-annotated scenes based on designing a missing-annotated instance mining module and reliable background mining module. Our proposed method produces competitive results when compared with SOTA weakly-supervised methods using the same or even more annotation costs. Besides, compared with SOTA fully-supervised methods, we achieve on-par or even better performance on the KITTI dataset with about 5x less annotation cost, and 90% of their performance on the Waymo dataset with about 15x less annotation cost. The additional unlabeled training scenes could further boost the performance.

Are Dense Labels Always Necessary for 3D Object Detection from Point Cloud?

TL;DR

This work tackles the high cost of densely annotated 3D bounding boxes for point-cloud object detection by proposing SS3D++, a sparsely-supervised framework that trains detectors using only one annotated object per scene. It iteratively improves detector performance by mining reliable background and confident missing-annotated instances to generate confident fully-annotated scenes for augmentation, while employing a multi-criteria curriculum to manage pseudo-labels. Across KITTI and Waymo, SS3D++ achieves competitive or superior results relative to state-of-the-art weakly/semi-supervised methods, and, in several cases, approaches or matches fully supervised performance with substantially lower annotation cost (approximately 5× on KITTI and 15× on Waymo). The method is detector-agnostic and can leverage unlabeled data to further boost performance, offering a practical path toward scalable 3D detection in autonomous driving settings.

Abstract

Current state-of-the-art (SOTA) 3D object detection methods often require a large amount of 3D bounding box annotations for training. However, collecting such large-scale densely-supervised datasets is notoriously costly. To reduce the cumbersome data annotation process, we propose a novel sparsely-annotated framework, in which we just annotate one 3D object per scene. Such a sparse annotation strategy could significantly reduce the heavy annotation burden, while inexact and incomplete sparse supervision may severely deteriorate the detection performance. To address this issue, we develop the SS3D++ method that alternatively improves 3D detector training and confident fully-annotated scene generation in a unified learning scheme. Using sparse annotations as seeds, we progressively generate confident fully-annotated scenes based on designing a missing-annotated instance mining module and reliable background mining module. Our proposed method produces competitive results when compared with SOTA weakly-supervised methods using the same or even more annotation costs. Besides, compared with SOTA fully-supervised methods, we achieve on-par or even better performance on the KITTI dataset with about 5x less annotation cost, and 90% of their performance on the Waymo dataset with about 15x less annotation cost. The additional unlabeled training scenes could further boost the performance.
Paper Structure (30 sections, 14 equations, 20 figures, 16 tables, 5 algorithms)

This paper contains 30 sections, 14 equations, 20 figures, 16 tables, 5 algorithms.

Figures (20)

  • Figure 1: Illustration of annotation setup for different supervision forms. The green boxes represent 3D or 2D box annotations and the green crosses represent point-level center annotations in the BEV point cloud map. In this paper, we explore a sparse annotation setup in which we just annotate one 3D object in each scene, as shown in Fig. (e).
  • Figure 2: Illustration of 3D moderate AP (Average Precision) vs. annotation cost, tested on KITTI kitti validation set and the AP is calculated with 40 recall positions at IoU 0.7 for the car to compare with the previous methods. Compared with SOTA semi-supervised detectors, our SS3D++ yields promising results with a far lower annotation demand. Besides, when providing less annotations than the remarkably weakly-supervised methods, our SS3D++ still shows profitable detection performance. The cost calculation is based on analysis in FGR fgr and ViT-WSS3D wss3d.
  • Figure 3: The pipeline of our SS3D++ algorithm. We alternatively improve 3D detector training and confident fully-annotated scene generation in a unified learning scheme. To efficiently train the 3D detector, we construct confident fully-annotated scenes based on the missing-annotated instance mining module, the reliable background mining module , and the GT sampling data augmentation strategy. By leveraging the mutual amelioration between high-quality training scene generation and 3D detector training, the obtained 3D detector becomes more robust.
  • Figure 4: Illustration of proposed confident missing-annotated instance mining module. The scene $P_i$ and the corresponding augmented scene $\mathcal{A}({P}_i)$ are input into the 3D detector. Then we leverage the dynamic loss-based filtering to remove the predictions of $P_i$ and $\mathcal{A}({P}_i)$ with a low confidence score (i.e., large classification loss). Further, the dynamic consistency-guided suppression is proposed to filter out low-quality predictions. Lastly, we carry out the density-aware curriculum filtering process and store the remaining predictions into the instance bank as confident pseudo instances.
  • Figure 5: (a) Illustration of learning hardness (i.e., density) of different 3D objects on the KITTI dataset. (b) We select mined 3D objects in a meaningful order from "easy" to "hard". The red represents the foreground points and the green represents the ground-truth bounding boxes.
  • ...and 15 more figures