Table of Contents
Fetching ...

OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding

Guibiao Liao, Kaichen Zhou, Zhenyu Bao, Kanglin Liu, Qing Li

TL;DR

OV-NeRF is proposed, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies and proposes a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics.

Abstract

The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field learning encounter difficulties due to noisy and view-inconsistent semantics provided by CLIP. To tackle these limitations, we propose OV-NeRF, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies. First, from the single-view perspective, we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask proposals derived from Segment Anything (SAM) to rectify the noisy semantics of each training view, facilitating accurate semantic field learning. Second, from the cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics. Rather than invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the 3D consistent semantics generated from the well-trained semantic field itself for semantic field training, aiming to reduce ambiguity and enhance overall semantic consistency across different views. Extensive experiments validate our OV-NeRF outperforms current state-of-the-art methods, achieving a significant improvement of 20.31% and 18.42% in mIoU metric on Replica and ScanNet, respectively. Furthermore, our approach exhibits consistent superior results across various CLIP configurations, further verifying its robustness. Project page: https://github.com/pcl3dv/OV-NeRF.

OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding

TL;DR

OV-NeRF is proposed, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies and proposes a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics.

Abstract

The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field learning encounter difficulties due to noisy and view-inconsistent semantics provided by CLIP. To tackle these limitations, we propose OV-NeRF, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies. First, from the single-view perspective, we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask proposals derived from Segment Anything (SAM) to rectify the noisy semantics of each training view, facilitating accurate semantic field learning. Second, from the cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics. Rather than invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the 3D consistent semantics generated from the well-trained semantic field itself for semantic field training, aiming to reduce ambiguity and enhance overall semantic consistency across different views. Extensive experiments validate our OV-NeRF outperforms current state-of-the-art methods, achieving a significant improvement of 20.31% and 18.42% in mIoU metric on Replica and ScanNet, respectively. Furthermore, our approach exhibits consistent superior results across various CLIP configurations, further verifying its robustness. Project page: https://github.com/pcl3dv/OV-NeRF.
Paper Structure (16 sections, 12 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 16 sections, 12 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Visualization of relevancy maps with the text query "Table". As shown in (a), the two relevancy maps of different views derived from the original CLIP exhibit much coarseness and view inconsistency. However, 3DOVS 3DOVS directly leverages these relevancy maps for semantic field learning, leading to inferior rendering results, as exhibited in (c). In contrast, the proposed Region Semantic Ranking regularization significantly improves the quality of relevancy maps as illustrated in (b), yielding the precise result in (d).
  • Figure 2: Overview of OV-NeRF Optimization: Left: OV-NeRF feature rendering process. OV-NeRF represents a semantic field of 3D volumes using a trainable MLP network (Section \ref{['Overview_framework']}). Right: Optimization of the trainable OV-NeRF by utilizing vision and language foundation models. Initially, multi-view training images undergo processing in the CLIP visual encoder to extract image features. Concurrently, text embeddings are obtained through the CLIP text encoder. To optimize OV-NeRF, we propose two key strategies: Region Semantic Ranking (RSR) regularization and Cross-view Self-enhancement (CSE) strategy, along with the incorporation of the CLIP feature loss. Specifically, the Segment Anything model is employed to produce region proposals over the corresponding images. Then, utilizing pre-computed CLIP features and SAM's region proposals, our RSR generates the precise relevancy map (blue border) to supervise OV-NeRF (Section \ref{['RSR_design']}), instead of using the original noisy relevancy map (gray border) derived from the CLIP model. Furthermore, after training OV-NeRF for several epochs, we leverage the rendered pseudo outputs obtained from OV-NeRF, encompassing both training views (green border) and unseen novel views (orange border), for cross-view self-enhancement supervision (Section \ref{['CSE_design']}).
  • Figure 3: Visualization of relevancy maps. The first and second rows represent different views. (a) $\sim$ (e) are obtained from the training views, while (f) $\sim$ (g) are acquired from the testing views. For more details refer to Section \ref{['CSE_design']}.
  • Figure 4: Visual examples of novel semantic pseudo map synthesis across different views. These novel pseudo maps can show the main structure of the scene, providing additional information for semantic field learning.
  • Figure 5: Qualitative results of various NeRF-based 3D open-vocabulary segmentation methods with different initialization, including CLIP model ($2^{nd} \sim 4^{th}$), OpenCLIP model ($6^{th} \sim 8^{th}$), and fine-tuned CLIP model ($10^{th} \sim 11^{th}$). Our method achieves more accurate and view-consistent results in various scenes.
  • ...and 4 more figures