Table of Contents
Fetching ...

OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy Representation

Haochen Jiang, Yueming Xu, Yihan Zeng, Hang Xu, Wei Zhang, Jianfeng Feng, Li Zhang

TL;DR

OpenOcc addresses the lack of open-world 3D understanding in NeRF-based reconstructions by unifying occupancy-based geometry with an open vocabulary semantic field distilled from 2D language features. It introduces a multi-resolution occupancy grid, a separate semantic field, and a semantic-aware confidence propagation (SCP) mechanism to stabilize zero-shot inferences across viewpoints, along with a suite of losses including $L_{rgb}$, $L_{depth}$, $L_{occ}$, $L_{fs}$, and $L_{sg}$ for robust training. The approach achieves competitive 3D reconstruction quality and improved small-object segmentation on Replica/ScanNet/Matterport3D, while enabling zero-shot 3D understanding for robotic navigation and cross-view predictions. It also demonstrates practical advantages in memory and computation compared to density-based NeRF variants, enabling efficient real-time-like perception for mobile robots.

Abstract

3D reconstruction has been widely used in autonomous navigation fields of mobile robotics. However, the former research can only provide the basic geometry structure without the capability of open-world scene understanding, limiting advanced tasks like human interaction and visual navigation. Moreover, traditional 3D scene understanding approaches rely on expensive labeled 3D datasets to train a model for a single task with supervision. Thus, geometric reconstruction with zero-shot scene understanding i.e. Open vocabulary 3D Understanding and Reconstruction, is crucial for the future development of mobile robots. In this paper, we propose OpenOcc, a novel framework unifying the 3D scene reconstruction and open vocabulary understanding with neural radiance fields. We model the geometric structure of the scene with occupancy representation and distill the pre-trained open vocabulary model into a 3D language field via volume rendering for zero-shot inference. Furthermore, a novel semantic-aware confidence propagation (SCP) method has been proposed to relieve the issue of language field representation degeneracy caused by inconsistent measurements in distilled features. Experimental results show that our approach achieves competitive performance in 3D scene understanding tasks, especially for small and long-tail objects.

OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy Representation

TL;DR

OpenOcc addresses the lack of open-world 3D understanding in NeRF-based reconstructions by unifying occupancy-based geometry with an open vocabulary semantic field distilled from 2D language features. It introduces a multi-resolution occupancy grid, a separate semantic field, and a semantic-aware confidence propagation (SCP) mechanism to stabilize zero-shot inferences across viewpoints, along with a suite of losses including , , , , and for robust training. The approach achieves competitive 3D reconstruction quality and improved small-object segmentation on Replica/ScanNet/Matterport3D, while enabling zero-shot 3D understanding for robotic navigation and cross-view predictions. It also demonstrates practical advantages in memory and computation compared to density-based NeRF variants, enabling efficient real-time-like perception for mobile robots.

Abstract

3D reconstruction has been widely used in autonomous navigation fields of mobile robotics. However, the former research can only provide the basic geometry structure without the capability of open-world scene understanding, limiting advanced tasks like human interaction and visual navigation. Moreover, traditional 3D scene understanding approaches rely on expensive labeled 3D datasets to train a model for a single task with supervision. Thus, geometric reconstruction with zero-shot scene understanding i.e. Open vocabulary 3D Understanding and Reconstruction, is crucial for the future development of mobile robots. In this paper, we propose OpenOcc, a novel framework unifying the 3D scene reconstruction and open vocabulary understanding with neural radiance fields. We model the geometric structure of the scene with occupancy representation and distill the pre-trained open vocabulary model into a 3D language field via volume rendering for zero-shot inference. Furthermore, a novel semantic-aware confidence propagation (SCP) method has been proposed to relieve the issue of language field representation degeneracy caused by inconsistent measurements in distilled features. Experimental results show that our approach achieves competitive performance in 3D scene understanding tasks, especially for small and long-tail objects.
Paper Structure (28 sections, 20 equations, 7 figures, 4 tables)

This paper contains 28 sections, 20 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Open-vocabulary 3D scene understanding and reconstruction. We propose OpenOcc, a zero-shot method for 3D scene structure perception. The examples above show zero-shot 3D scene semantic segmentation results with the occupancy feature grid. The blue color denotes the matching results of a user-specified query string to demonstrate the flexibility of the language-based features grid.
  • Figure 2: The overall framework of the proposed method.Left: Given a series of posed RGB-D frames, we construct the RGB, geometry, and semantic decoders via separate multi-resolution feature gird with the geometric loss $\mathcal{L}_{occ}, \mathcal{L}_{fs}, \mathcal{L}_{depth}$ and color loss $\mathcal{L}_{color}$. To learn the language knowledge, we distill the dense feature $\mathcal{F}_{2d}$ via volume rendering with a distillation loss $\mathcal{L}_{sg}$. Right: During inference, we can compute the similarity score between user's text embeddings and generate an occupancy feature map to perform the open-vocabulary 3D understanding task.
  • Figure 3: Different semantic feature update strategies. Owing to the potentially noisy language embedding, the open vocabulary segmentation results in different views are inconsistent in the same training batch that can show the left image. (a) Semantic field updates use the same weight. (b) The proposed Semantic-aware Confidence Propagation (SCP). Dashed lines mean smaller weight. The boundary color of the center point means the fusing feature is dominated by which consistent semantic class.
  • Figure 4: Qualitative comparisons. Images of 3D semantic segmentation results on three public indoor benchmarks.
  • Figure 5: 2D segmentation results on ScanNet. We visualized some 2D segmentation examples from the ScanNet validation set. (a) depicts the Input Image, while (b) showcases the OpenSeg result, (c) illustrates our method, and (d) represents GT Segmentation. Black pixels in the ground truth segmentation correspond to classes not included in the ScanNet-20 evaluation classes.
  • ...and 2 more figures