Table of Contents
Fetching ...

LeAP: Consistent multi-domain 3D labeling using Foundation Models

Simon Gebraad, Andras Palffy, Holger Caesar

TL;DR

LeAP tackles the challenge of scarce 3D semantic labels by leveraging open-vocabulary 2D foundation models to generate soft, per-pixel labels from unlabeled image–LiDAR pairs, then propagates these labels to 3D using a Bayesian voxel fusion framework and a 3D Consistency Network. The approach achieves domain-agnostic pseudo-labeling across automotive and agricultural domains, enabling cross-domain adaptation that yields significant mIoU gains (up to +34.2) when adapting existing 3D models to new domains. A 3D backbone trained on high-confidence camera-based pseudo-labels further refines the results, proving that multi-modal fusion can robustly improve 3D labeling where labeled data is scarce. The work demonstrates practical impact by broadening the applicability of 3D semantic labeling to diverse domains and sensor configurations without manual annotation, though it notes limitations in prompt reliability and potential error reinforcement, pointing to future improvements in prompts and dynamic scene handling.

Abstract

Availability of datasets is a strong driver for research on 3D semantic understanding, and whilst obtaining unlabeled 3D point cloud data is straightforward, manually annotating this data with semantic labels is time-consuming and costly. Recently, Vision Foundation Models (VFMs) enable open-set semantic segmentation on camera images, potentially aiding automatic labeling. However,VFMs for 3D data have been limited to adaptations of 2D models, which can introduce inconsistencies to 3D labels. This work introduces Label Any Pointcloud (LeAP), leveraging 2D VFMs to automatically label 3D data with any set of classes in any kind of application whilst ensuring label consistency. Using a Bayesian update, point labels are combined into voxels to improve spatio-temporal consistency. A novel 3D Consistency Network (3D-CN) exploits 3D information to further improve label quality. Through various experiments, we show that our method can generate high-quality 3D semantic labels across diverse fields without any manual labeling. Further, models adapted to new domains using our labels show up to a 34.2 mIoU increase in semantic segmentation tasks.

LeAP: Consistent multi-domain 3D labeling using Foundation Models

TL;DR

LeAP tackles the challenge of scarce 3D semantic labels by leveraging open-vocabulary 2D foundation models to generate soft, per-pixel labels from unlabeled image–LiDAR pairs, then propagates these labels to 3D using a Bayesian voxel fusion framework and a 3D Consistency Network. The approach achieves domain-agnostic pseudo-labeling across automotive and agricultural domains, enabling cross-domain adaptation that yields significant mIoU gains (up to +34.2) when adapting existing 3D models to new domains. A 3D backbone trained on high-confidence camera-based pseudo-labels further refines the results, proving that multi-modal fusion can robustly improve 3D labeling where labeled data is scarce. The work demonstrates practical impact by broadening the applicability of 3D semantic labeling to diverse domains and sensor configurations without manual annotation, though it notes limitations in prompt reliability and potential error reinforcement, pointing to future improvements in prompts and dynamic scene handling.

Abstract

Availability of datasets is a strong driver for research on 3D semantic understanding, and whilst obtaining unlabeled 3D point cloud data is straightforward, manually annotating this data with semantic labels is time-consuming and costly. Recently, Vision Foundation Models (VFMs) enable open-set semantic segmentation on camera images, potentially aiding automatic labeling. However,VFMs for 3D data have been limited to adaptations of 2D models, which can introduce inconsistencies to 3D labels. This work introduces Label Any Pointcloud (LeAP), leveraging 2D VFMs to automatically label 3D data with any set of classes in any kind of application whilst ensuring label consistency. Using a Bayesian update, point labels are combined into voxels to improve spatio-temporal consistency. A novel 3D Consistency Network (3D-CN) exploits 3D information to further improve label quality. Through various experiments, we show that our method can generate high-quality 3D semantic labels across diverse fields without any manual labeling. Further, models adapted to new domains using our labels show up to a 34.2 mIoU increase in semantic segmentation tasks.

Paper Structure

This paper contains 17 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of our LeAP automatic labeling method. ($\textbf{A}$) Taking only paired image-LiDAR data as input, foundation models are used to generate image labels for any set of classes in any application. ($\textbf{B}$) A Bayesian voxel update and ($\textbf{C}$) a novel cmst improve label consistency, resulting in high quality pseudo-labels.
  • Figure 2: The process of generating 2D pseudo-labels. Using unlabeled images and a list of classes, we use Grounding Dino liu_grounding_2023 features to obtain regions with soft labels. Segment Anything kirillov_segment_2023 converts these to detailed masks and allows us to obtain per-pixel soft labels.
  • Figure 3: The process of generating 3D pseudo-labels. Point clouds are painted with image-based labels and probabilistically accumulated in a voxel grid, ensuring spatial-temporal consistency. The universal voxel representation enables fusion with our .
  • Figure 4: Qualitative results of our pseudo-labeling pipeline. Frames of SemanticKITTI behley_semantickitti_2019 and our own AgriUAV drone dataset are shown on the top and bottom respectively. Colours correspond to the classes in Table \ref{['table:domain_adapt']}. Black points are unlabeled, e.g. when outside the camera frame or of an unsupported class. Voxels enable us to label vastly more points than 2D-3D projection alone. The bottom row clearly shows how pre-trained models from different domains often fail to transfer to new domains and how 3D-CN can improve spatial consistancy.