Table of Contents
Fetching ...

GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields

Yunsong Wang, Hanlin Chen, Gim Hee Lee

TL;DR

GOV-NeSF tackles open vocabulary 3D scene understanding without relying on depth priors or per scene optimization by introducing a generalizable implicit representation. It combines a geometry aware 3D cost volume with a Multi view Joint Fusion module and Cross View Attention to blend multi view color and open vocabulary features from Vision Language Models, trained purely on 2D data. The method yields state of the art performance in both 2D and 3D open vocabulary semantic segmentation across unseen scenes and datasets, and supports novel view synthesis without explicit depth or labeled data. Overall, GOV-NeSF demonstrates strong generalization and practical potential for cross scene open world understanding in robotics and vision systems.

Abstract

Recent advancements in vision-language foundation models have significantly enhanced open-vocabulary 3D scene understanding. However, the generalizability of existing methods is constrained due to their framework designs and their reliance on 3D data. We address this limitation by introducing Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF), a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics. We aggregate the geometry-aware features using a cost volume, and propose a Multi-view Joint Fusion module to aggregate multi-view features through a cross-view attention mechanism, which effectively predicts view-specific blending weights for both colors and open-vocabulary features. Remarkably, our GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation, eliminating the need for ground truth semantic labels or depth priors, and effectively generalize across scenes and datasets without fine-tuning.

GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields

TL;DR

GOV-NeSF tackles open vocabulary 3D scene understanding without relying on depth priors or per scene optimization by introducing a generalizable implicit representation. It combines a geometry aware 3D cost volume with a Multi view Joint Fusion module and Cross View Attention to blend multi view color and open vocabulary features from Vision Language Models, trained purely on 2D data. The method yields state of the art performance in both 2D and 3D open vocabulary semantic segmentation across unseen scenes and datasets, and supports novel view synthesis without explicit depth or labeled data. Overall, GOV-NeSF demonstrates strong generalization and practical potential for cross scene open world understanding in robotics and vision systems.

Abstract

Recent advancements in vision-language foundation models have significantly enhanced open-vocabulary 3D scene understanding. However, the generalizability of existing methods is constrained due to their framework designs and their reliance on 3D data. We address this limitation by introducing Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF), a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics. We aggregate the geometry-aware features using a cost volume, and propose a Multi-view Joint Fusion module to aggregate multi-view features through a cross-view attention mechanism, which effectively predicts view-specific blending weights for both colors and open-vocabulary features. Remarkably, our GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation, eliminating the need for ground truth semantic labels or depth priors, and effectively generalize across scenes and datasets without fine-tuning.
Paper Structure (15 sections, 12 equations, 5 figures, 4 tables)

This paper contains 15 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overall pipeline of GOV-NeSF. Given the posed images from any unseen 3D scene, and arbitrary open-vocabulary text queries, our model is capable of both open-vocabulary 3D semantic segmentation and novel view synthesis with 2D semantic segmentation.
  • Figure 2: Structure of GOV-NeSF. Given a set of posed images of the 3D scene, we first use a shared image encoder to extract the 2D feature maps, and unproject them to build a 3D cost volume. Moreover, we leverage LSeg lseg to predict the per-pixel open-vocabulary features. We then perform Multi-View Stereo to query the 2D and open-vocabulary features for each sampled 3D point along the ray, concatenate the queried 2D features and the volume feature, and feed them into the FusionNet to predict blending weights. The final color and open-vocabulary feature are the weighted sum of multi views using the blending weights.
  • Figure 3: FusionNet Structure. We aggregate multi-view features through Cross-View Attention module, and predict view-specific blending weights. Refer to the text for more details.
  • Figure 4: Visualization of 2D results. We show the GT color images, our rendered color images, S-Ray semanticray-OV rendered semantics, our rendered semantics, LSeg$_{gt}$lseg predictions, and GT semantics on novel views from unseen scenes in ScanNet dai2017scannet and Replica replica.
  • Figure 5: Visualization of 3D results. We compare with OpenScene-2D openscene in terms of 3D semantic segmentation on ScanNet.