Are Dense Labels Always Necessary for 3D Object Detection from Point Cloud?
Chenqiang Gao, Chuandong Liu, Jun Shu, Fangcen Liu, Jiang Liu, Luyu Yang, Xinbo Gao, Deyu Meng
TL;DR
This work tackles the high cost of densely annotated 3D bounding boxes for point-cloud object detection by proposing SS3D++, a sparsely-supervised framework that trains detectors using only one annotated object per scene. It iteratively improves detector performance by mining reliable background and confident missing-annotated instances to generate confident fully-annotated scenes for augmentation, while employing a multi-criteria curriculum to manage pseudo-labels. Across KITTI and Waymo, SS3D++ achieves competitive or superior results relative to state-of-the-art weakly/semi-supervised methods, and, in several cases, approaches or matches fully supervised performance with substantially lower annotation cost (approximately 5× on KITTI and 15× on Waymo). The method is detector-agnostic and can leverage unlabeled data to further boost performance, offering a practical path toward scalable 3D detection in autonomous driving settings.
Abstract
Current state-of-the-art (SOTA) 3D object detection methods often require a large amount of 3D bounding box annotations for training. However, collecting such large-scale densely-supervised datasets is notoriously costly. To reduce the cumbersome data annotation process, we propose a novel sparsely-annotated framework, in which we just annotate one 3D object per scene. Such a sparse annotation strategy could significantly reduce the heavy annotation burden, while inexact and incomplete sparse supervision may severely deteriorate the detection performance. To address this issue, we develop the SS3D++ method that alternatively improves 3D detector training and confident fully-annotated scene generation in a unified learning scheme. Using sparse annotations as seeds, we progressively generate confident fully-annotated scenes based on designing a missing-annotated instance mining module and reliable background mining module. Our proposed method produces competitive results when compared with SOTA weakly-supervised methods using the same or even more annotation costs. Besides, compared with SOTA fully-supervised methods, we achieve on-par or even better performance on the KITTI dataset with about 5x less annotation cost, and 90% of their performance on the Waymo dataset with about 15x less annotation cost. The additional unlabeled training scenes could further boost the performance.
