Table of Contents
Fetching ...

You Only Click Once: Single Point Weakly Supervised 3D Instance Segmentation for Autonomous Driving

Guangfeng Jiang, Jun Liu, Yongxuan Lv, Yuzhi Wu, Xianfei Li, Wenlong Liao, Tao He, Pai Peng

TL;DR

YoCo targets outdoor LiDAR 3D instance segmentation with minimal annotation by converting single BEV clicks into accurate 3D pseudo labels through a Vision Foundation Model-guided pipeline (VFM-PLG). It further refines labels online via temporal-spatial updates (TSU) and offline IoU-guided enhancement (ILE), integrated into a Mean Teacher training framework. The approach yields state-of-the-art results among weakly supervised methods and can match or exceed fully supervised baselines with only a tiny fraction of labeled data, significantly reducing annotation costs. The framework demonstrates strong generality across different backbones and practical robustness to annotation noise, making it suitable for scalable deployment in autonomous driving.

Abstract

Outdoor LiDAR point cloud 3D instance segmentation is a crucial task in autonomous driving. However, it requires laborious human efforts to annotate the point cloud for training a segmentation model. To address this challenge, we propose a YoCo framework, which generates 3D pseudo labels using minimal coarse click annotations in the bird's eye view plane. It is a significant challenge to produce high-quality pseudo labels from sparse annotations. Our YoCo framework first leverages vision foundation models combined with geometric constraints from point clouds to enhance pseudo label generation. Second, a temporal and spatial-based label updating module is designed to generate reliable updated labels. It leverages predictions from adjacent frames and utilizes the inherent density variation of point clouds (dense near, sparse far). Finally, to further improve label quality, an IoU-guided enhancement module is proposed, replacing pseudo labels with high-confidence and high-IoU predictions. Experiments on the Waymo dataset demonstrate YoCo's effectiveness and generality, achieving state-of-the-art performance among weakly supervised methods and surpassing fully supervised Cylinder3D. Additionally, the YoCo is suitable for various networks, achieving performance comparable to fully supervised methods with minimal fine-tuning using only 0.8% of the fully labeled data, significantly reducing annotation costs.

You Only Click Once: Single Point Weakly Supervised 3D Instance Segmentation for Autonomous Driving

TL;DR

YoCo targets outdoor LiDAR 3D instance segmentation with minimal annotation by converting single BEV clicks into accurate 3D pseudo labels through a Vision Foundation Model-guided pipeline (VFM-PLG). It further refines labels online via temporal-spatial updates (TSU) and offline IoU-guided enhancement (ILE), integrated into a Mean Teacher training framework. The approach yields state-of-the-art results among weakly supervised methods and can match or exceed fully supervised baselines with only a tiny fraction of labeled data, significantly reducing annotation costs. The framework demonstrates strong generality across different backbones and practical robustness to annotation noise, making it suitable for scalable deployment in autonomous driving.

Abstract

Outdoor LiDAR point cloud 3D instance segmentation is a crucial task in autonomous driving. However, it requires laborious human efforts to annotate the point cloud for training a segmentation model. To address this challenge, we propose a YoCo framework, which generates 3D pseudo labels using minimal coarse click annotations in the bird's eye view plane. It is a significant challenge to produce high-quality pseudo labels from sparse annotations. Our YoCo framework first leverages vision foundation models combined with geometric constraints from point clouds to enhance pseudo label generation. Second, a temporal and spatial-based label updating module is designed to generate reliable updated labels. It leverages predictions from adjacent frames and utilizes the inherent density variation of point clouds (dense near, sparse far). Finally, to further improve label quality, an IoU-guided enhancement module is proposed, replacing pseudo labels with high-confidence and high-IoU predictions. Experiments on the Waymo dataset demonstrate YoCo's effectiveness and generality, achieving state-of-the-art performance among weakly supervised methods and surpassing fully supervised Cylinder3D. Additionally, the YoCo is suitable for various networks, achieving performance comparable to fully supervised methods with minimal fine-tuning using only 0.8% of the fully labeled data, significantly reducing annotation costs.

Paper Structure

This paper contains 18 sections, 7 equations, 7 figures, 13 tables, 1 algorithm.

Figures (7)

  • Figure 1: 3D Instance Segmentation Performance Comparison. The weakly supervised YoCo for fine-tuning compared to fully supervised methods. The results show that our YoCo outperforms fully supervised Cylinder3D without fine-tuning (0%). Fine-tuning YoCo with 0.8% and 5% labeled data exceeds the fully supervised SparseUnet and state-of-the-art (SOTA) PTv3, respectively.
  • Figure 2: Overview of YoCo Framework. The YoCo consists of two main components: (a) pseudo label generation and (b) network training. For pseudo label generation, the VFM-PLG module produces high-quality pseudo labels using both VFMs and geometric constraints. For network training, our YoCo adopts the classic Mean Teacher tarvainen2017meanteacher structure. The TSU module performs online updates to the pseudo labels by using predictions from adjacent frames in a voxelized manner. Additionally, the ILE module enhances offline pseudo labels by leveraging high-confidence and high-IoU predictions to improve their quality.
  • Figure 3: Our VFM-PLG for the Composite Categories. Comparison of 2D mask results for the same instance with different prompts (colored stars). w/o shows results without DAM, while w shows results with DAM. Using the DAM model yields more consistent and accurate results across different prompts.
  • Figure 4: Overview of VFM-PLG Module. The blue dashed line indicates that if the generated 3D mask does not satisfy geometric constraints, another point is selected as the prompt. GC denotes that the point cloud is processed using geometric constraints.
  • Figure 5: Comparisons of IoU on WOD Validation Dataset. We combine click-level and 2D bounding box-level annotations with various methods to generate pseudo labels for the validation dataset and evaluate them with ground truth.
  • ...and 2 more figures