Table of Contents
Fetching ...

MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors

Zhenhua Du, Binbin Xu, Haoyu Zhang, Kai Huo, Shuaifeng Zhi

TL;DR

MOSE addresses the challenge of dense 3D semantic reconstruction from monocular imagery by lifting imperfect 2D priors into a unified NeRF-based implicit representation that jointly models appearance, geometry, and semantics. It introduces two key innovations: Locally-Consistent Fusion, which enforces segment-level semantic coherence using generic 2D segment masks, and Semantically-Weighted Geometric Regularization, which adaptively strengthens surface smoothness in planar semantic regions to improve geometry and, in turn, semantics. Across ScanNet with NYU-40 semantics, MOSE achieves state-of-the-art performance in 3D semantic segmentation, 2D semantic segmentation, and 3D surface reconstruction, demonstrating the mutual benefits of geometry and semantics when guided by priors. The approach enables robust indoor scene understanding from monocular cues and opens avenues for advancing AR and robotics applications that rely on accurate 3D semantic maps.

Abstract

Accurately reconstructing dense and semantically annotated 3D meshes from monocular images remains a challenging task due to the lack of geometry guidance and imperfect view-dependent 2D priors. Though we have witnessed recent advancements in implicit neural scene representations enabling precise 2D rendering simply from multi-view images, there have been few works addressing 3D scene understanding with monocular priors alone. In this paper, we propose MOSE, a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D, producing accurate semantics and geometry in both 3D and 2D space. The key motivation for our method is to leverage generic class-agnostic segment masks as guidance to promote local consistency of rendered semantics during training. With the help of semantics, we further apply a smoothness regularization to texture-less regions for better geometric quality, thus achieving mutual benefits of geometry and semantics. Experiments on the ScanNet dataset show that our MOSE outperforms relevant baselines across all metrics on tasks of 3D semantic segmentation, 2D semantic segmentation and 3D surface reconstruction.

MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors

TL;DR

MOSE addresses the challenge of dense 3D semantic reconstruction from monocular imagery by lifting imperfect 2D priors into a unified NeRF-based implicit representation that jointly models appearance, geometry, and semantics. It introduces two key innovations: Locally-Consistent Fusion, which enforces segment-level semantic coherence using generic 2D segment masks, and Semantically-Weighted Geometric Regularization, which adaptively strengthens surface smoothness in planar semantic regions to improve geometry and, in turn, semantics. Across ScanNet with NYU-40 semantics, MOSE achieves state-of-the-art performance in 3D semantic segmentation, 2D semantic segmentation, and 3D surface reconstruction, demonstrating the mutual benefits of geometry and semantics when guided by priors. The approach enables robust indoor scene understanding from monocular cues and opens avenues for advancing AR and robotics applications that rely on accurate 3D semantic maps.

Abstract

Accurately reconstructing dense and semantically annotated 3D meshes from monocular images remains a challenging task due to the lack of geometry guidance and imperfect view-dependent 2D priors. Though we have witnessed recent advancements in implicit neural scene representations enabling precise 2D rendering simply from multi-view images, there have been few works addressing 3D scene understanding with monocular priors alone. In this paper, we propose MOSE, a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D, producing accurate semantics and geometry in both 3D and 2D space. The key motivation for our method is to leverage generic class-agnostic segment masks as guidance to promote local consistency of rendered semantics during training. With the help of semantics, we further apply a smoothness regularization to texture-less regions for better geometric quality, thus achieving mutual benefits of geometry and semantics. Experiments on the ScanNet dataset show that our MOSE outperforms relevant baselines across all metrics on tasks of 3D semantic segmentation, 2D semantic segmentation and 3D surface reconstruction.
Paper Structure (13 sections, 9 equations, 10 figures, 4 tables)

This paper contains 13 sections, 9 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: 3D indoor semantic reconstruction. Taking RGB images and noisy 2D scene priors from monocular networks (upper portion), our method MOSE, is able to reconstruct the 3D smooth semantic map of the scene and render 2D associated results (bottom portion).
  • Figure 2: Overview of MOSE. Utilizing RGB images and estimated normals, semantic labels, as well as segment masks, MOSE learns the color field, signed distance function (SDF) field and semantic field of the scene through an implicit neural representation. To address the discontinuity of 2D semantic predictions, we propose a locally-consistent fusion strategy (Sec. \ref{['section 3.2']}) leveraging 2D segmentation techniques. Semantically-weighted geometric regularization (Sec. \ref{['section 3.3']}) is further introduced to bring benefits to both the SDF field and semantic field.
  • Figure 3: Overview of locally-consistent fusion strategy. Severe discontinuity and inconsistency of semantics can be observed when directly inputting noisy multi-view labels into a NeRF-based fusion system (upper part). Our LCF strategy utilizes 2D segment priors to enforce consistent and accurate semantic distributions within each segment mask (bottom part).
  • Figure 4: Overview of semantically-weighted geometric regularization. Achieving a balance between planar and object regions is challenging with the widely-used Eikonal loss gropp2020igr: large loss weights lead to discontinuity in texture regions, while small loss weights result in loss of object details. Our proposed semantically-weighted geometric regularization (SGR) dynamically adjusts the regularization strength across different semantic classes, resulting in more accurate surface reconstruction. Semantics also benefit from the more accurate radiance field.
  • Figure 5: Qualitative comparisons of 3D semantic reconstruction results. Our method is able to produce smoother 3D semantic map and align well with GT results, while Manhattan-SDF* and NeuRIS* exhibits severe inconsistencies of semantic labels.
  • ...and 5 more figures