Table of Contents
Fetching ...

LOSC: LiDAR Open-voc Segmentation Consolidator

Nermin Samet, Gilles Puy, Renaud Marlet

Abstract

We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins. Code is available at https://github.com/valeoai/LOSC.

LOSC: LiDAR Open-voc Segmentation Consolidator

Abstract

We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins. Code is available at https://github.com/valeoai/LOSC.

Paper Structure

This paper contains 26 sections, 4 equations, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Performance of LOSC compared to SOTA on zero-shot open-voc segmentation.
  • Figure 2: Overall pipeline of LOSC. The images are first segmented using an open-vocabulary segmentation model. The semantic 2D labels are then backprojected on the lidar points. These 3D labels are refined using three consolidation steps: the first uses label consistency across several image augmentation, the second uses time consistency, the last combines the best labels from the previous steps. These labels are used to finetune a 3D network. Finally, few steps of self-training are used to improve the results.
  • Figure 3: Qualitative results of semantic segmentation from the validation sets of nuScenes and SemanticKITTI. The color code used to represent each class is provided in Supplementary Material. A typical error with LOSC on both datasets is a confusion between different types of flat surfaces such as road/driveable surface, sidewalk and terrain. In SemanticKITTI, trunks are also systematically included in vegetation rather than considered as a separate class. These observations are consistent with the quantitative class-wise results provided in Supplementary Material.
  • Figure 4: Color code used to represent each class on nuScenes (top) and SemanticKIITI (bottom) in Figure 3.