Weakly Supervised LiDAR Semantic Segmentation via Scatter Image Annotation
Yilong Chen, Zongyi Xu, xiaoshui Huang, Ruicheng Zhang, Xinqi Jiang, Xinbo Gao
TL;DR
This work tackles the annotation bottleneck in weakly supervised LiDAR semantic segmentation by introducing Scatter Image Annotation, which densifies sparse image-based annotations into dense 2D/3D labels using GMFlowNet and SAM. It then presents ScatterNet, a three-branch architecture with an independent Fusion Stream and a perceptual consistency loss to effectively fuse image and LiDAR information during training, while only requiring the LiDAR stream at test time. The approach achieves near fully supervised performance with a tiny fraction of labeled data on nuScenes (0.02%) and SemanticKITTI (0.004%), outperforming prior weakly supervised methods and demonstrating robust cross-modal fusion. This method significantly reduces labeling costs and expands practical deployment of LiDAR semantic segmentation in outdoor, multimodal scenes, albeit requiring synchronized and spatially aligned camera-LiDAR data.
Abstract
Weakly supervised LiDAR semantic segmentation has made significant strides with limited labeled data. However, most existing methods focus on the network training under weak supervision, while efficient annotation strategies remain largely unexplored. To tackle this gap, we implement LiDAR semantic segmentation using scatter image annotation, effectively integrating an efficient annotation strategy with network training. Specifically, we propose employing scatter images to annotate LiDAR point clouds, combining a pre-trained optical flow estimation network with a foundation image segmentation model to rapidly propagate manual annotations into dense labels for both images and point clouds. Moreover, we propose ScatterNet, a network that includes three pivotal strategies to reduce the performance gap caused by such annotations. Firstly, it utilizes dense semantic labels as supervision for the image branch, alleviating the modality imbalance between point clouds and images. Secondly, an intermediate fusion branch is proposed to obtain multimodal texture and structural features. Lastly, a perception consistency loss is introduced to determine which information needs to be fused and which needs to be discarded during the fusion process. Extensive experiments on the nuScenes and SemanticKITTI datasets have demonstrated that our method requires less than 0.02% of the labeled points to achieve over 95% of the performance of fully-supervised methods. Notably, our labeled points are only 5% of those used in the most advanced weakly supervised methods.
