Table of Contents
Fetching ...

Zero-Shot 4D Lidar Panoptic Segmentation

Yushan Zhang, Aljoša Ošep, Laura Leal-Taixé, Tim Meinhardt

TL;DR

This work tackles open-ended, zero-shot 4D Lidar understanding by introducing SAL-4D, a pipeline that distills Video Object Segmentation and Vision-Language foundation models into Lidar data. It constructs temporally coherent pseudo-labels via a Track--Lift--Flatten engine and cross-window association, enabling a 4D segmentation model to learn without labeled 4D data. SAL-4D delivers strong zero-shot performance, significantly outperforming single-scan baselines and narrowing the gap to supervised methods on SemanticKITTI and Panoptic nuScenes, while also enabling recognition of objects outside fixed vocabularies through CLIP tokens. The approach demonstrates that temporal coherence and multi-modal distillation can unlock zero-shot 4D Lidar panoptic segmentation, with practical implications for embodied navigation and semantic mapping.

Abstract

Zero-shot 4D segmentation and recognition of arbitrary objects in Lidar is crucial for embodied navigation, with applications ranging from streaming perception to semantic mapping and localization. However, the primary challenge in advancing research and developing generalized, versatile methods for spatio-temporal scene understanding in Lidar lies in the scarcity of datasets that provide the necessary diversity and scale of annotations.To overcome these challenges, we propose SAL-4D (Segment Anything in Lidar--4D), a method that utilizes multi-modal robotic sensor setups as a bridge to distill recent developments in Video Object Segmentation (VOS) in conjunction with off-the-shelf Vision-Language foundation models to Lidar. We utilize VOS models to pseudo-label tracklets in short video sequences, annotate these tracklets with sequence-level CLIP tokens, and lift them to the 4D Lidar space using calibrated multi-modal sensory setups to distill them to our SAL-4D model. Due to temporal consistent predictions, we outperform prior art in 3D Zero-Shot Lidar Panoptic Segmentation (LPS) over $5$ PQ, and unlock Zero-Shot 4D-LPS.

Zero-Shot 4D Lidar Panoptic Segmentation

TL;DR

This work tackles open-ended, zero-shot 4D Lidar understanding by introducing SAL-4D, a pipeline that distills Video Object Segmentation and Vision-Language foundation models into Lidar data. It constructs temporally coherent pseudo-labels via a Track--Lift--Flatten engine and cross-window association, enabling a 4D segmentation model to learn without labeled 4D data. SAL-4D delivers strong zero-shot performance, significantly outperforming single-scan baselines and narrowing the gap to supervised methods on SemanticKITTI and Panoptic nuScenes, while also enabling recognition of objects outside fixed vocabularies through CLIP tokens. The approach demonstrates that temporal coherence and multi-modal distillation can unlock zero-shot 4D Lidar panoptic segmentation, with practical implications for embodied navigation and semantic mapping.

Abstract

Zero-shot 4D segmentation and recognition of arbitrary objects in Lidar is crucial for embodied navigation, with applications ranging from streaming perception to semantic mapping and localization. However, the primary challenge in advancing research and developing generalized, versatile methods for spatio-temporal scene understanding in Lidar lies in the scarcity of datasets that provide the necessary diversity and scale of annotations.To overcome these challenges, we propose SAL-4D (Segment Anything in Lidar--4D), a method that utilizes multi-modal robotic sensor setups as a bridge to distill recent developments in Video Object Segmentation (VOS) in conjunction with off-the-shelf Vision-Language foundation models to Lidar. We utilize VOS models to pseudo-label tracklets in short video sequences, annotate these tracklets with sequence-level CLIP tokens, and lift them to the 4D Lidar space using calibrated multi-modal sensory setups to distill them to our SAL-4D model. Due to temporal consistent predictions, we outperform prior art in 3D Zero-Shot Lidar Panoptic Segmentation (LPS) over PQ, and unlock Zero-Shot 4D-LPS.

Paper Structure

This paper contains 38 sections, 11 equations, 7 figures, 14 tables, 3 algorithms.

Figures (7)

  • Figure 1: SAL-4D pseudo-label engine. We first independently pseudo-label overlapping sliding windows (\ref{['fig:sal4d-window']}). We track and segment objects in the video using ravi2024sam, generate their semantic features using CLIP, and lift labels from images to 4D Lidar space. Finally, we "flatten" masklets to obtain a unique non-overlapping set of masklets in Lidar for each temporal window. We associate masklets across windows via linear assignment (LA) to obtain pseudo-labels for full sequences and average their semantic features (\ref{['fig:sal4d-cross-window']}).
  • Figure 2: SAL-4D model segments individual spatio-temporal instances in 4D Lidar sequences and predicts per-track CLIP tokens that foster test-time zero-shot recognition via text prompts.
  • Figure 3: Qualitative results. We compare our 4D pseudo-labels (obtained over windows of $2\&8$ frames) to GT labels, and single-scan labels. By contrast to GT, our automatically-generated labels cover both thing and stuff classes. As can be seen, the temporal coherence of labels improves over larger window sizes.
  • Figure 4: Qualitative results on SemanticKITTI. We show ground-truth (GT) labels (first column), our pseudo-labels (middle column), and SAL-4D results (right column). We show semantic predictions (first row) and instances (second row). As can be seen, our pseudo-labels cover only the camera-visible portion of the sequence (middle). By contrast to GT labels, our pseudo-label instances are not limited to a subset of thing classes (GT, left column). Our trained SAL-4D thus learns to densely segment all classes in space and time (right column). Importantly, pseudo-labels do not provide semantic labels, only CLIP tokens. For visualization, we prompt individual instances with prompts that conform to the SemanticKITTI class vocabulary. Best seen zoomed.
  • Figure 5: Prompt examples. We visualize the output of our model (we highlight objects in orange) for four different prompts: two canonical car and bicycle rider, and two "arbitrary" object, advertising stand and electric street box. As can be seen, all are segmented correctly, including stationary and moving instances. Remarkably, all three different types of advertising stand, and both instances of electric street box are correctly segmented. We provide images for reference; images are not used as input to our model. Best seen in color, zoomed.
  • ...and 2 more figures