3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Zihao Xiao; Longlong Jing; Shangxuan Wu; Alex Zihao Zhu; Jingwei Ji; Chiyu Max Jiang; Wei-Chih Hung; Thomas Funkhouser; Weicheng Kuo; Anelia Angelova; Yin Zhou; Shiwei Sheng

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Zihao Xiao, Longlong Jing, Shangxuan Wu, Alex Zihao Zhu, Jingwei Ji, Chiyu Max Jiang, Wei-Chih Hung, Thomas Funkhouser, Weicheng Kuo, Anelia Angelova, Yin Zhou, Shiwei Sheng

TL;DR

This work tackles 3D open-vocabulary panoptic segmentation for autonomous driving by fusing learnable LiDAR features with dense frozen vision-language model (CLIP) features and employing two distillation losses to bridge 3D space with CLIP embeddings. It introduces a unified segmentation head that predicts class embeddings in the CLIP space and uses cosine similarity to CLIP text embeddings for open-vocabulary classification, accompanied by object-level ($L_O$) and voxel-level ($L_V$) distillation losses. The final objective combines standard panoptic losses with these distillation terms: $L = w_α L_{cls} + w_β L_{mask} + w_λ L_O + w_γ L_V$. Experiments on nuScenes and SemanticKITTI demonstrate large improvements over a strong FC-CLIP baseline and existing open-vocabulary methods, validating the effectiveness of multimodal fusion and distillation for open-set 3D panoptic perception in autonomous driving.

Abstract

3D panoptic segmentation is a challenging perception task, especially in autonomous driving. It aims to predict both semantic and instance annotations for 3D points in a scene. Although prior 3D panoptic segmentation approaches have achieved great performance on closed-set benchmarks, generalizing these approaches to unseen things and unseen stuff categories remains an open problem. For unseen object categories, 2D open-vocabulary segmentation has achieved promising results that solely rely on frozen CLIP backbones and ensembling multiple classification outputs. However, we find that simply extending these 2D models to 3D does not guarantee good performance due to poor per-mask classification quality, especially for novel stuff categories. In this paper, we propose the first method to tackle 3D open-vocabulary panoptic segmentation. Our model takes advantage of the fusion between learnable LiDAR features and dense frozen vision CLIP features, using a single classification head to make predictions for both base and novel classes. To further improve the classification performance on novel classes and leverage the CLIP model, we propose two novel loss functions: object-level distillation loss and voxel-level distillation loss. Our experiments on the nuScenes and SemanticKITTI datasets show that our method outperforms the strong baseline by a large margin.

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

TL;DR

) and voxel-level (

) distillation losses. The final objective combines standard panoptic losses with these distillation terms:

. Experiments on nuScenes and SemanticKITTI demonstrate large improvements over a strong FC-CLIP baseline and existing open-vocabulary methods, validating the effectiveness of multimodal fusion and distillation for open-set 3D panoptic perception in autonomous driving.

Abstract

Paper Structure (19 sections, 13 equations, 10 figures, 9 tables)

This paper contains 19 sections, 13 equations, 10 figures, 9 tables.

Introduction
Related Work
Method
Problem Definition
3D Open-Vocabulary Panoptic Segmentation
Loss Function
Implementation Details
Experiments
Experimental Setting
P3Former-FC-CLIP Baseline
Main Results
Ablation Studies and Analysis
Conclusion
PFC Baseline
Query Assignment
...and 4 more sections

Figures (10)

Figure 1: An illustration of 3D open-vocabulary panoptic segmentation results from our model. Without training on the categories of bus, trash can or vegetation, our method can produce accurate panoptic segmentation results even when the points are sparse.
Figure 2: Overview of our method. Given a LiDAR point cloud and the corresponding camera images, LiDAR features are extracted with a learnable LiDAR encoder, while vision features are extracted by a frozen CLIP vision model. The extracted LiDAR features and the frozen CLIP vision features are then fused and fed to a query-based transformer model to predict instance masks and semantic classes.
Figure 3: (a) the proposed object-level distillation loss, and (b) the proposed voxel-level distillation loss.
Figure 4: Open-vocabulary panoptic segmentation results from PFC and our method on nuScenes. PFC predicts inaccurate category and masks for the novel pedestrian (red), bus (yellow) and vegetation (green), while ours makes correct predictions.
Figure 5: Open-vocabulary exploration on nuScenes. We show the novel material/object in blue color. The orientation of the ego vehicle is fixed in the LiDAR point visualization while the reference images come from on of the surrounding cameras of the ego vehicle.
...and 5 more figures

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

TL;DR

Abstract

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)