Table of Contents
Fetching ...

IGLOSS: Image Generation for Lidar Open-vocabulary Semantic Segmentation

Nermin Samet, Gilles Puy, Renaud Marlet

Abstract

This paper presents a new method for the zero-shot open-vocabulary semantic segmentation (OVSS) of 3D automotive lidar data. To circumvent the recognized image-text modality gap that is intrinsic to approaches based on Vision Language Models (VLMs) such as CLIP, our method relies instead on image generation from text, to create prototype images. Given a 3D network distilled from a 2D Vision Foundation Model (VFM), we then label a point cloud by matching 3D point features with 2D image features of these prototypes. Our method is state-of-the-art for OVSS on nuScenes and SemanticKITTI. Code, pre-trained models, and generated images are available at https://github.com/valeoai/IGLOSS.

IGLOSS: Image Generation for Lidar Open-vocabulary Semantic Segmentation

Abstract

This paper presents a new method for the zero-shot open-vocabulary semantic segmentation (OVSS) of 3D automotive lidar data. To circumvent the recognized image-text modality gap that is intrinsic to approaches based on Vision Language Models (VLMs) such as CLIP, our method relies instead on image generation from text, to create prototype images. Given a 3D network distilled from a 2D Vision Foundation Model (VFM), we then label a point cloud by matching 3D point features with 2D image features of these prototypes. Our method is state-of-the-art for OVSS on nuScenes and SemanticKITTI. Code, pre-trained models, and generated images are available at https://github.com/valeoai/IGLOSS.

Paper Structure

This paper contains 42 sections, 7 figures, 12 tables.

Figures (7)

  • Figure 1: IGLOSS vs typical 3D OVSS methods.Closed-set 3D OVSS methods (top) train a 3D network on pseudo-labels provided by a 2D OVSS method. Typical 3D OVSS methods (middle) distill a 2D VLM into a 3D network, leveraging some mask generator at training time; inference consists in matching text features to 3D features, on a nearest prompt basis (NN). Our method, IGLOSS (bottom), exploits a 2D VFM distilled into a 3D network, making it a 3D VFM. Inference consists in generating prototype images from text prompts and matching 2D features of these images to 3D features, using test-time adaption with multinomial logistic regression (LR).
  • Figure 2: Examples of segment retrieval for arbitrary classes. As IGLOSS is OV, it can segment any class defined by a free text prompt. We present (points reprojected onto images) OV results for usually-ignored classes wheelchair, stroller, and crosswalk, which are not included in the 16 official nuScenes classes nuscenes. Crosswalks are not even identified in nuScenes and are just labeled as any driveable surface. Although crosswalk points are geometrically on the road plane, their features nonetheless identifies them specifically because the 2D-3D distillation takes lidar intensity into account scalr.
  • Figure 3: Examples of generated images (see \ref{['sec:experiments']}) with ChatGPT (top), Gemini (middle) and Flux (bottom). Web images are not shown due to copyright restrictions.
  • Figure 4: IGLOSS inference pipeline. Given classes to segment, we first generate prototypes images using an off-the-shelf image generator, e.g., ChatGPT. These images are fed into a 2D VFM, e.g., DiNOv2, and representative 2D image features are extracted for each class. Using these 2D features, we fit a multinomial logistic regression model (LR). This model is then used to classify the 3D features of lidar points, which are aligned by design on their 2D counterparts.
  • Figure 5: Number of prototype images. (a) We use an increasing number of randomly generated ChatGPT images per prompt. Performance (on NS val set) plateaus at two. (b) Five random experiments with two images/prompt show a small variance.
  • ...and 2 more figures