You Only Click Once: Single Point Weakly Supervised 3D Instance Segmentation for Autonomous Driving
Guangfeng Jiang, Jun Liu, Yongxuan Lv, Yuzhi Wu, Xianfei Li, Wenlong Liao, Tao He, Pai Peng
TL;DR
YoCo targets outdoor LiDAR 3D instance segmentation with minimal annotation by converting single BEV clicks into accurate 3D pseudo labels through a Vision Foundation Model-guided pipeline (VFM-PLG). It further refines labels online via temporal-spatial updates (TSU) and offline IoU-guided enhancement (ILE), integrated into a Mean Teacher training framework. The approach yields state-of-the-art results among weakly supervised methods and can match or exceed fully supervised baselines with only a tiny fraction of labeled data, significantly reducing annotation costs. The framework demonstrates strong generality across different backbones and practical robustness to annotation noise, making it suitable for scalable deployment in autonomous driving.
Abstract
Outdoor LiDAR point cloud 3D instance segmentation is a crucial task in autonomous driving. However, it requires laborious human efforts to annotate the point cloud for training a segmentation model. To address this challenge, we propose a YoCo framework, which generates 3D pseudo labels using minimal coarse click annotations in the bird's eye view plane. It is a significant challenge to produce high-quality pseudo labels from sparse annotations. Our YoCo framework first leverages vision foundation models combined with geometric constraints from point clouds to enhance pseudo label generation. Second, a temporal and spatial-based label updating module is designed to generate reliable updated labels. It leverages predictions from adjacent frames and utilizes the inherent density variation of point clouds (dense near, sparse far). Finally, to further improve label quality, an IoU-guided enhancement module is proposed, replacing pseudo labels with high-confidence and high-IoU predictions. Experiments on the Waymo dataset demonstrate YoCo's effectiveness and generality, achieving state-of-the-art performance among weakly supervised methods and surpassing fully supervised Cylinder3D. Additionally, the YoCo is suitable for various networks, achieving performance comparable to fully supervised methods with minimal fine-tuning using only 0.8% of the fully labeled data, significantly reducing annotation costs.
