ALISE: Annotation-Free LiDAR Instance Segmentation for Autonomous Driving
Yongxuan Lyu, Guangfeng Jiang, Hongsi Liu, Jun Liu
TL;DR
ALISE tackles the high cost of annotating outdoor LiDAR for 3D instance segmentation by deploying Vision Foundation Models to generate initial pseudo-labels from images, followed by a robust spatio-temporal refinement and multi-faceted semantic supervision. The approach preserves VFM semantic distributions, employs offline and online pseudo-label refinement, and introduces a prototype-based contrastive learning framework to learn discriminative 3D features without labels. Key contributions include a UPG pseudo-label generator, CVIM cross-view merging, VFM-based distillation (VPD), and a dual-frame prototype contrastive loss (PCL), achieving state-of-the-art performance among unsupervised methods and competitive results compared to weakly supervised baselines. This annotation-free pipeline significantly reduces labeling costs while maintaining strong performance on Waymo and nuScenes, with practical implications for scalable autonomous-driving perception.
Abstract
The manual annotation of outdoor LiDAR point clouds for instance segmentation is extremely costly and time-consuming. Current methods attempt to reduce this burden but still rely on some form of human labeling. To completely eliminate this dependency, we introduce ALISE, a novel framework that performs LiDAR instance segmentation without any annotations. The central challenge is to generate high-quality pseudo-labels in a fully unsupervised manner. Our approach starts by employing Vision Foundation Models (VFMs), guided by text and images, to produce initial pseudo-labels. We then refine these labels through a dedicated spatio-temporal voting module, which combines 2D and 3D semantics for both offline and online optimization. To achieve superior feature learning, we further introduce two forms of semantic supervision: a set of 2D prior-based losses that inject visual knowledge into the 3D network, and a novel prototype-based contrastive loss that builds a discriminative feature space by exploiting 3D semantic consistency. This comprehensive design results in significant performance gains, establishing a new state-of-the-art for unsupervised 3D instance segmentation. Remarkably, our approach even outperforms MWSIS, a method that operates with supervision from ground-truth (GT) 2D bounding boxes by a margin of 2.53% in mAP (50.95% vs. 48.42%).
