Table of Contents
Fetching ...

ALISE: Annotation-Free LiDAR Instance Segmentation for Autonomous Driving

Yongxuan Lyu, Guangfeng Jiang, Hongsi Liu, Jun Liu

TL;DR

ALISE tackles the high cost of annotating outdoor LiDAR for 3D instance segmentation by deploying Vision Foundation Models to generate initial pseudo-labels from images, followed by a robust spatio-temporal refinement and multi-faceted semantic supervision. The approach preserves VFM semantic distributions, employs offline and online pseudo-label refinement, and introduces a prototype-based contrastive learning framework to learn discriminative 3D features without labels. Key contributions include a UPG pseudo-label generator, CVIM cross-view merging, VFM-based distillation (VPD), and a dual-frame prototype contrastive loss (PCL), achieving state-of-the-art performance among unsupervised methods and competitive results compared to weakly supervised baselines. This annotation-free pipeline significantly reduces labeling costs while maintaining strong performance on Waymo and nuScenes, with practical implications for scalable autonomous-driving perception.

Abstract

The manual annotation of outdoor LiDAR point clouds for instance segmentation is extremely costly and time-consuming. Current methods attempt to reduce this burden but still rely on some form of human labeling. To completely eliminate this dependency, we introduce ALISE, a novel framework that performs LiDAR instance segmentation without any annotations. The central challenge is to generate high-quality pseudo-labels in a fully unsupervised manner. Our approach starts by employing Vision Foundation Models (VFMs), guided by text and images, to produce initial pseudo-labels. We then refine these labels through a dedicated spatio-temporal voting module, which combines 2D and 3D semantics for both offline and online optimization. To achieve superior feature learning, we further introduce two forms of semantic supervision: a set of 2D prior-based losses that inject visual knowledge into the 3D network, and a novel prototype-based contrastive loss that builds a discriminative feature space by exploiting 3D semantic consistency. This comprehensive design results in significant performance gains, establishing a new state-of-the-art for unsupervised 3D instance segmentation. Remarkably, our approach even outperforms MWSIS, a method that operates with supervision from ground-truth (GT) 2D bounding boxes by a margin of 2.53% in mAP (50.95% vs. 48.42%).

ALISE: Annotation-Free LiDAR Instance Segmentation for Autonomous Driving

TL;DR

ALISE tackles the high cost of annotating outdoor LiDAR for 3D instance segmentation by deploying Vision Foundation Models to generate initial pseudo-labels from images, followed by a robust spatio-temporal refinement and multi-faceted semantic supervision. The approach preserves VFM semantic distributions, employs offline and online pseudo-label refinement, and introduces a prototype-based contrastive learning framework to learn discriminative 3D features without labels. Key contributions include a UPG pseudo-label generator, CVIM cross-view merging, VFM-based distillation (VPD), and a dual-frame prototype contrastive loss (PCL), achieving state-of-the-art performance among unsupervised methods and competitive results compared to weakly supervised baselines. This annotation-free pipeline significantly reduces labeling costs while maintaining strong performance on Waymo and nuScenes, with practical implications for scalable autonomous-driving perception.

Abstract

The manual annotation of outdoor LiDAR point clouds for instance segmentation is extremely costly and time-consuming. Current methods attempt to reduce this burden but still rely on some form of human labeling. To completely eliminate this dependency, we introduce ALISE, a novel framework that performs LiDAR instance segmentation without any annotations. The central challenge is to generate high-quality pseudo-labels in a fully unsupervised manner. Our approach starts by employing Vision Foundation Models (VFMs), guided by text and images, to produce initial pseudo-labels. We then refine these labels through a dedicated spatio-temporal voting module, which combines 2D and 3D semantics for both offline and online optimization. To achieve superior feature learning, we further introduce two forms of semantic supervision: a set of 2D prior-based losses that inject visual knowledge into the 3D network, and a novel prototype-based contrastive loss that builds a discriminative feature space by exploiting 3D semantic consistency. This comprehensive design results in significant performance gains, establishing a new state-of-the-art for unsupervised 3D instance segmentation. Remarkably, our approach even outperforms MWSIS, a method that operates with supervision from ground-truth (GT) 2D bounding boxes by a margin of 2.53% in mAP (50.95% vs. 48.42%).

Paper Structure

This paper contains 27 sections, 15 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Performance comparison of ALISE against methods with different supervision types. Our label-free method ALISE (at 0% GT) surpasses weakly supervised baselines. When fine-tuned with a small amount of GT labels, ALISE consistently outperforms the fully supervised baseline.
  • Figure 2: Illustration of the UPG module and the OFR module. Blue points represent the current frame, while orange points represent the adjacent frame. Different classes are indicated by using circles and triangles.
  • Figure 3: Illustration of the ONR module, the VPD module, and the PCL module in training stage.
  • Figure 4: Visualization of rider-bicycle instance merging.
  • Figure 5: Visualization of cross-view instance merging (CVIM).
  • ...and 2 more figures