Table of Contents
Fetching ...

Panoptic Vision-Language Feature Fields

Haoran Chen, Kenneth Blomqvist, Francesco Milano, Roland Siegwart

TL;DR

Open-vocabulary 3D panoptic segmentation is addressed with PVLFF, which decouples semantic and instance feature fields within a neural radiance field and distills vision-language embeddings for open-set semantic understanding. The instance field is learned from 2D proposals via contrastive learning, enabling consistent multi-view instance segmentation without relying on predefined classes. PVLFF achieves competitive scene-level panoptic performance against closed-set baselines and outperforms zero-shot semantic segmentation on multiple datasets, while providing hierarchical multi-scale instance representations via clustering. This approach advances flexible, language-guided 3D scene understanding with practical implications for robotics and AR, and code is publicly available.

Abstract

Recently, methods have been proposed for 3D open-vocabulary semantic segmentation. Such methods are able to segment scenes into arbitrary classes based on text descriptions provided during runtime. In this paper, we propose to the best of our knowledge the first algorithm for open-vocabulary panoptic segmentation in 3D scenes. Our algorithm, Panoptic Vision-Language Feature Fields (PVLFF), learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model, and jointly fits an instance feature field through contrastive learning using 2D instance segments on input frames. Despite not being trained on the target classes, our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset and additionally outperforms current 3D open-vocabulary systems in terms of semantic segmentation. We ablate the components of our method to demonstrate the effectiveness of our model architecture. Our code will be available at https://github.com/ethz-asl/pvlff.

Panoptic Vision-Language Feature Fields

TL;DR

Open-vocabulary 3D panoptic segmentation is addressed with PVLFF, which decouples semantic and instance feature fields within a neural radiance field and distills vision-language embeddings for open-set semantic understanding. The instance field is learned from 2D proposals via contrastive learning, enabling consistent multi-view instance segmentation without relying on predefined classes. PVLFF achieves competitive scene-level panoptic performance against closed-set baselines and outperforms zero-shot semantic segmentation on multiple datasets, while providing hierarchical multi-scale instance representations via clustering. This approach advances flexible, language-guided 3D scene understanding with practical implications for robotics and AR, and code is publicly available.

Abstract

Recently, methods have been proposed for 3D open-vocabulary semantic segmentation. Such methods are able to segment scenes into arbitrary classes based on text descriptions provided during runtime. In this paper, we propose to the best of our knowledge the first algorithm for open-vocabulary panoptic segmentation in 3D scenes. Our algorithm, Panoptic Vision-Language Feature Fields (PVLFF), learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model, and jointly fits an instance feature field through contrastive learning using 2D instance segments on input frames. Despite not being trained on the target classes, our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset and additionally outperforms current 3D open-vocabulary systems in terms of semantic segmentation. We ablate the components of our method to demonstrate the effectiveness of our model architecture. Our code will be available at https://github.com/ethz-asl/pvlff.
Paper Structure (14 sections, 7 equations, 5 figures, 3 tables)

This paper contains 14 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of PVLFF. Given 2D posed images, PVLFF optimizes a semantic feature field by distilling vision-language embeddings from an off-the-shelf network $\mathrm{E^{VL}}$li2022lseg, and simultaneously trains an instance feature field through contrastive learning based on 2D instance proposals computed by $\mathrm{E^{IS}}$kirillov2023segany. After training through different loss functions ($\mathcal{L}$), PVLFF is able to perform panoptic segmentation under open-vocabulary prompts.
  • Figure 2: Architecture of PVLFF. Given a 3D coordinate $\mathbf{x}$ and a unit direction $\mathbf{d_r}$, PVLFF uses two sets of hybrid hash encoding (HHE) blomqvist2022baking to parameterize the 3D volume for panoptic scene understanding. With one HHE, we encode color $c$, density $\sigma$ and semantic feature $\mathcal{F_S}$. With the other HHE, we exclusively learn instance feature $\mathcal{F_I}$. All these scene properties are modeled by lightweight multilayer perceptrons (MLPs)
  • Figure 3: PVLFF Optimization. We optimize the panoptic feature fields by distilling knowledge from the off-the-shelf 2D models li2022lsegkirillov2023segany. For semantic feature learning \ref{['fig:optimization-semantic']}, we supervise rendered semantic features with precomputed pixel-level VL embeddings. For instance feature learning \ref{['fig:optimization-instance']}, we pre-compute instance masks using a 2D instance segmenter kirillov2023segany. We then sample pixels across masks to form positive and negative pairs, and render corresponding instance features. We compute similarity among pairs and optimize instance features by contrastive learning. In addition, we estimate the feature center of each instance mask using instance feature field with exponential moving average (EMA) parameters and apply a $l_1$ loss between the instance features and the feature centers.
  • Figure 4: PVLFF with Open-Vocabulary language queries. We query PVLFF with $101$ Replica replica19arxiv semantic prompts, and visualize the instance and semantic features through PCA. We show the instance segmentation results directly from HDBSCAN McInnes2017hdbscan and the semantic segmentation together the denoised one.
  • Figure 5: Hierarchical instance features of PVLFF. We run HDBSCAN on rendered instance features and visualize the clustering results. In the clustering tree, each colored node represents a predicted instance. Since we compute instance masks using SAM, which produces multiple levels of segmentation, PVLFF over-segments instances by default. However, we can recover different levels of instance predictions through clustering and we show the multi-level predictions of "sofa" and "ceiling" from the finest to the complete segmentation, by visualizing the leaf node gb]0.43,0.56,0.750.43,0.56,0.75ff, the mid-part gb]0.51,0.70,0.40.51,0.70,0.4ff and the sub-structure gb]0.84,0.61,0.00.84,0.61,0.0ff in the clustering tree.