Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

Youquan Liu; Lingdong Kong; Jun Cen; Runnan Chen; Wenwei Zhang; Liang Pan; Kai Chen; Ziwei Liu

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

Youquan Liu, Lingdong Kong, Jun Cen, Runnan Chen, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu

TL;DR

Seal presents a scalable, cross-modal framework that distills semantic cues from vision foundation models into 3D automotive point clouds for self-supervised segmentation. By replacing traditional image-based region proposals with semantically informed superpixels from VFMs and enforcing camera-to-LiDAR and temporal consistencies via specialized losses, Seal achieves strong linear-probing and few-shot performance across 11 diverse datasets, including nuScenes where it reaches 45.0% mIoU. The key contributions are the VFM-assisted spatial contrastive loss $\mathcal{L}^{\text{vfm}}$, the superpoint temporal consistency loss $\mathcal{L}^{\text{tmp}}$, and the point-to-segment regularization $\mathcal{L}^{\text{p2s}}$, which together enable robust cross-modal representation learning and generalization to varied data distributions. The work demonstrates significant practical impact by reducing annotation needs, improving robustness to sensor misalignment and environmental perturbations, and enabling accurate segmentation of diverse automotive point clouds for downstream perception tasks.

Abstract

Recent advancements in vision foundation models (VFMs) have opened up new possibilities for versatile and efficient visual perception. In this work, we introduce Seal, a novel framework that harnesses VFMs for segmenting diverse automotive point cloud sequences. Seal exhibits three appealing properties: i) Scalability: VFMs are directly distilled into point clouds, obviating the need for annotations in either 2D or 3D during pretraining. ii) Consistency: Spatial and temporal relationships are enforced at both the camera-to-LiDAR and point-to-segment regularization stages, facilitating cross-modal representation learning. iii) Generalizability: Seal enables knowledge transfer in an off-the-shelf manner to downstream tasks involving diverse point clouds, including those from real/synthetic, low/high-resolution, large/small-scale, and clean/corrupted datasets. Extensive experiments conducted on eleven different point cloud datasets showcase the effectiveness and superiority of Seal. Notably, Seal achieves a remarkable 45.0% mIoU on nuScenes after linear probing, surpassing random initialization by 36.9% mIoU and outperforming prior arts by 6.1% mIoU. Moreover, Seal demonstrates significant performance gains over existing methods across 20 different few-shot fine-tuning tasks on all eleven tested point cloud datasets.

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

TL;DR

, the superpoint temporal consistency loss

, and the point-to-segment regularization

, which together enable robust cross-modal representation learning and generalization to varied data distributions. The work demonstrates significant practical impact by reducing annotation needs, improving robustness to sensor misalignment and environmental perturbations, and enabling accurate segmentation of diverse automotive point clouds for downstream perception tasks.

Abstract

Paper Structure (26 sections, 4 equations, 16 figures, 15 tables)

This paper contains 26 sections, 4 equations, 16 figures, 15 tables.

Introduction
Related Work
Seal: A Scalable, Consistent, and Generalizable Framework
Preliminaries
Semantic Superpixel Spatial Consistency
Semantic Superpoint Temporal Consistency
Experiments
Settings
Comparative Study
Ablation Study
Concluding Remark
Additional Implementation Detail
Datasets
Vision Foundation Models
Implementation Detail
...and 11 more sections

Figures (16)

Figure 1: The proposed Seal distills semantic awareness on cameras views from VFMs to the point cloud via superpixel-driven contrastive learning. [1st row] Semantic superpixels generated by SLIC achanta2012slic and recent VFMs kirillov2023samzou2023xcoderzou2023seem, where each color represents one segment. [2nd row] Semantic superpoints grouped by projecting superpixels to 3D via camera-LiDAR correspondence. [3rd row] Visualizations of the linear probing results of our framework driven by SLIC and different VFMs.
Figure 2: Overview of the Seal framework. We generate, for each {LiDAR, camera} pair {$\mathcal{P}^{t},\mathcal{I}^{t}$} at timestamp $t$ and another LiDAR frame $\mathcal{P}^{t+n}$ at timestamp $t+n$, the semantic superpixel and superpoint by vision foundation models (VFMs). Two pertaining objectives are then formed, including spatial contrastive learning between paired LiDAR and camera features (Sec. \ref{['sec:c2l']}) and temporal consistency regularization between point segments at different timestamps (Sec. \ref{['sec:seal']}).
Figure 3: The positive feature correspondences in the contrastive learning objective in our contrastive learning framework. The circles and triangles represent the instance-level and the point-level features, respectively.
Figure 4: The cosine similarity between a query point (denoted as the red dot) and the feature learned with SLIC achanta2012slic and different VFMs kirillov2023samzou2023xcoderzhang2023openSeeDzou2023seem. The queried semantic classes from top to bottom examples are: "car", "manmade", and "truck". The color goes from violet to yellow denoting low and high similarity scores, respectively. Best viewed in color.
Figure 5: The qualitative results of different point cloud pretraining approaches pretrained on the raw data of nuScenesfong2022panoptic-nuScenes and fine-tuned with $1\%$ labeled data. To highlight the differences, the correct / incorrect predictions are painted in gray / red, respectively. Best viewed in color.
...and 11 more figures

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

TL;DR

Abstract

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (16)