Table of Contents
Fetching ...

Weakly Supervised LiDAR Semantic Segmentation via Scatter Image Annotation

Yilong Chen, Zongyi Xu, xiaoshui Huang, Ruicheng Zhang, Xinqi Jiang, Xinbo Gao

TL;DR

This work tackles the annotation bottleneck in weakly supervised LiDAR semantic segmentation by introducing Scatter Image Annotation, which densifies sparse image-based annotations into dense 2D/3D labels using GMFlowNet and SAM. It then presents ScatterNet, a three-branch architecture with an independent Fusion Stream and a perceptual consistency loss to effectively fuse image and LiDAR information during training, while only requiring the LiDAR stream at test time. The approach achieves near fully supervised performance with a tiny fraction of labeled data on nuScenes (0.02%) and SemanticKITTI (0.004%), outperforming prior weakly supervised methods and demonstrating robust cross-modal fusion. This method significantly reduces labeling costs and expands practical deployment of LiDAR semantic segmentation in outdoor, multimodal scenes, albeit requiring synchronized and spatially aligned camera-LiDAR data.

Abstract

Weakly supervised LiDAR semantic segmentation has made significant strides with limited labeled data. However, most existing methods focus on the network training under weak supervision, while efficient annotation strategies remain largely unexplored. To tackle this gap, we implement LiDAR semantic segmentation using scatter image annotation, effectively integrating an efficient annotation strategy with network training. Specifically, we propose employing scatter images to annotate LiDAR point clouds, combining a pre-trained optical flow estimation network with a foundation image segmentation model to rapidly propagate manual annotations into dense labels for both images and point clouds. Moreover, we propose ScatterNet, a network that includes three pivotal strategies to reduce the performance gap caused by such annotations. Firstly, it utilizes dense semantic labels as supervision for the image branch, alleviating the modality imbalance between point clouds and images. Secondly, an intermediate fusion branch is proposed to obtain multimodal texture and structural features. Lastly, a perception consistency loss is introduced to determine which information needs to be fused and which needs to be discarded during the fusion process. Extensive experiments on the nuScenes and SemanticKITTI datasets have demonstrated that our method requires less than 0.02% of the labeled points to achieve over 95% of the performance of fully-supervised methods. Notably, our labeled points are only 5% of those used in the most advanced weakly supervised methods.

Weakly Supervised LiDAR Semantic Segmentation via Scatter Image Annotation

TL;DR

This work tackles the annotation bottleneck in weakly supervised LiDAR semantic segmentation by introducing Scatter Image Annotation, which densifies sparse image-based annotations into dense 2D/3D labels using GMFlowNet and SAM. It then presents ScatterNet, a three-branch architecture with an independent Fusion Stream and a perceptual consistency loss to effectively fuse image and LiDAR information during training, while only requiring the LiDAR stream at test time. The approach achieves near fully supervised performance with a tiny fraction of labeled data on nuScenes (0.02%) and SemanticKITTI (0.004%), outperforming prior weakly supervised methods and demonstrating robust cross-modal fusion. This method significantly reduces labeling costs and expands practical deployment of LiDAR semantic segmentation in outdoor, multimodal scenes, albeit requiring synchronized and spatially aligned camera-LiDAR data.

Abstract

Weakly supervised LiDAR semantic segmentation has made significant strides with limited labeled data. However, most existing methods focus on the network training under weak supervision, while efficient annotation strategies remain largely unexplored. To tackle this gap, we implement LiDAR semantic segmentation using scatter image annotation, effectively integrating an efficient annotation strategy with network training. Specifically, we propose employing scatter images to annotate LiDAR point clouds, combining a pre-trained optical flow estimation network with a foundation image segmentation model to rapidly propagate manual annotations into dense labels for both images and point clouds. Moreover, we propose ScatterNet, a network that includes three pivotal strategies to reduce the performance gap caused by such annotations. Firstly, it utilizes dense semantic labels as supervision for the image branch, alleviating the modality imbalance between point clouds and images. Secondly, an intermediate fusion branch is proposed to obtain multimodal texture and structural features. Lastly, a perception consistency loss is introduced to determine which information needs to be fused and which needs to be discarded during the fusion process. Extensive experiments on the nuScenes and SemanticKITTI datasets have demonstrated that our method requires less than 0.02% of the labeled points to achieve over 95% of the performance of fully-supervised methods. Notably, our labeled points are only 5% of those used in the most advanced weakly supervised methods.
Paper Structure (31 sections, 19 equations, 15 figures, 7 tables)

This paper contains 31 sections, 19 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Comparison of our method with SLidR sautier2022image, IUPC sun2024image, LESS liu2022less, and Contra hou2021exploring on the nuScenes dataset. WS/FS denotes their relative performance compared to the fully supervised method.
  • Figure 2: Mitigating modality imbalances using dense semantic labels. We train the point cloud branch and two variants of the image branch, one with sparse labels and the other with dense ones, separately in a fully supervised manner on the SemanticKITTI behley2019semantickitti and nuScenes behley2019semantickitti datasets. The mIoU is the evaluation metric for semantic segmentation; higher values indicate greater accuracy.
  • Figure 3: Comparison of supervision labels for the image branch. (a) Sparse semantic labels: Projecting point cloud labels onto images. (b) Superpixel labels: Using the SLIC achanta2012slic to group pixels into superpixels. (c) Our method: Using SAM kirillov2023segment to generate dense semantic labels from manual annotations.
  • Figure 4: Weakly supervised annotation strategy: (a) Random annotation: Randomly selects a small subset of points for annotation. This approach has difficulty covering infrequent small objects, such as distant fences and cyclists. (b) Scribble annotation: Line-based scribbles, requiring just two clicks (start and end points), allow for quick marking of large geometric areas. However, this method is less effective for smaller objects such as pedestrians. (c) Active annotation: Ground detection is used to separate ground and non-ground points, after which the non-ground points are clustered into multiple point cloud subsets. Each subset is then labeled by annotators. (d) Image semantic annotation: Image semantic segmentation labels are mapped to 3D space for point cloud annotation. This method is more convenient compared to point-based annotation, as it does not require rotating the 3D space. However, dense pixel-based annotation is still time-consuming. (e) Our method (Scatter Image Annotation): Just 1-5 mouse clicks on an image are required to annotate each instance. This sparse, pixel-based approach provides ease of interaction and reduces annotation costs.
  • Figure 5: Overview of our method. (a) Scatter Image Annotation: Using GMFlowNet xu2022gmflow and SAM kirillov2023segment, manual annotations are propagated to dense labels. (b) ScatterNet: The network consists of three branches: Camera Stream, Fusion Stream, and LiDAR Stream. The Camera Stream is used for the extraction of image features; the LiDAR Stream handles the processing of point cloud features; while the Fusion Stream is used to merge the features extracted by the Camera and LiDAR streams. The loss of perceptual consistency consists of two parts: $\mathcal{L}_{per}^{f\to i}$ represents the consistency between Camera and Fusion Streams, and $\mathcal{L}_{per}^{p\to f}$ represents the consistency between LiDAR and Fusion streams.
  • ...and 10 more figures