Towards Learning to Complete Anything in Lidar
Ayca Takmaz, Cristiano Saltori, Neehar Peri, Tim Meinhardt, Riccardo de Lutio, Laura Leal-Taixé, Aljoša Ošep
TL;DR
This paper tackles zero-shot LiDAR scene completion by learning shape priors and open-vocabulary semantics from unlabeled, temporally rich multi-modal data. It introduces CAL, a two-part approach: a pseudo-labeling engine that mines 3D shape priors and CLIP-based semantics from video and a zero-shot, class-agnostic completion model built on a sparse 3D U-Net and a Transformer decoder, capable of SSC, PSC, or amodal detection with test-time vocabulary prompts. The key contributions include a scalable pseudo-labeling pipeline using vision foundation models, CLIP-derived semantic prototypes for open vocabularies, and a voxel-based completion framework that generalizes beyond fixed class vocabularies. Results on SemanticKITTI and SSCBench-KITTI360 show competitive zero-shot PSC/SSC performance and reveal both the promise and current limitations of open-vocabulary LiDAR perception, with significant room for improvement in label coverage and recognition under zero-shot conditions.
Abstract
We propose CAL (Complete Anything in Lidar) for Lidar-based shape-completion in-the-wild. This is closely related to Lidar-based semantic/panoptic scene completion. However, contemporary methods can only complete and recognize objects from a closed vocabulary labeled in existing Lidar datasets. Different to that, our zero-shot approach leverages the temporal context from multi-modal sensor sequences to mine object shapes and semantic features of observed objects. These are then distilled into a Lidar-only instance-level completion and recognition model. Although we only mine partial shape completions, we find that our distilled model learns to infer full object shapes from multiple such partial observations across the dataset. We show that our model can be prompted on standard benchmarks for Semantic and Panoptic Scene Completion, localize objects as (amodal) 3D bounding boxes, and recognize objects beyond fixed class vocabularies. Our project page is https://research.nvidia.com/labs/dvl/projects/complete-anything-lidar
