Zero-Shot 4D Lidar Panoptic Segmentation

Yushan Zhang; Aljoša Ošep; Laura Leal-Taixé; Tim Meinhardt

Zero-Shot 4D Lidar Panoptic Segmentation

Yushan Zhang, Aljoša Ošep, Laura Leal-Taixé, Tim Meinhardt

TL;DR

This work tackles open-ended, zero-shot 4D Lidar understanding by introducing SAL-4D, a pipeline that distills Video Object Segmentation and Vision-Language foundation models into Lidar data. It constructs temporally coherent pseudo-labels via a Track--Lift--Flatten engine and cross-window association, enabling a 4D segmentation model to learn without labeled 4D data. SAL-4D delivers strong zero-shot performance, significantly outperforming single-scan baselines and narrowing the gap to supervised methods on SemanticKITTI and Panoptic nuScenes, while also enabling recognition of objects outside fixed vocabularies through CLIP tokens. The approach demonstrates that temporal coherence and multi-modal distillation can unlock zero-shot 4D Lidar panoptic segmentation, with practical implications for embodied navigation and semantic mapping.

Abstract

Zero-shot 4D segmentation and recognition of arbitrary objects in Lidar is crucial for embodied navigation, with applications ranging from streaming perception to semantic mapping and localization. However, the primary challenge in advancing research and developing generalized, versatile methods for spatio-temporal scene understanding in Lidar lies in the scarcity of datasets that provide the necessary diversity and scale of annotations.To overcome these challenges, we propose SAL-4D (Segment Anything in Lidar--4D), a method that utilizes multi-modal robotic sensor setups as a bridge to distill recent developments in Video Object Segmentation (VOS) in conjunction with off-the-shelf Vision-Language foundation models to Lidar. We utilize VOS models to pseudo-label tracklets in short video sequences, annotate these tracklets with sequence-level CLIP tokens, and lift them to the 4D Lidar space using calibrated multi-modal sensory setups to distill them to our SAL-4D model. Due to temporal consistent predictions, we outperform prior art in 3D Zero-Shot Lidar Panoptic Segmentation (LPS) over $5$ PQ, and unlock Zero-Shot 4D-LPS.

Zero-Shot 4D Lidar Panoptic Segmentation

TL;DR

Abstract

Zero-Shot 4D Lidar Panoptic Segmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)