MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors
Zhenhua Du, Binbin Xu, Haoyu Zhang, Kai Huo, Shuaifeng Zhi
TL;DR
MOSE addresses the challenge of dense 3D semantic reconstruction from monocular imagery by lifting imperfect 2D priors into a unified NeRF-based implicit representation that jointly models appearance, geometry, and semantics. It introduces two key innovations: Locally-Consistent Fusion, which enforces segment-level semantic coherence using generic 2D segment masks, and Semantically-Weighted Geometric Regularization, which adaptively strengthens surface smoothness in planar semantic regions to improve geometry and, in turn, semantics. Across ScanNet with NYU-40 semantics, MOSE achieves state-of-the-art performance in 3D semantic segmentation, 2D semantic segmentation, and 3D surface reconstruction, demonstrating the mutual benefits of geometry and semantics when guided by priors. The approach enables robust indoor scene understanding from monocular cues and opens avenues for advancing AR and robotics applications that rely on accurate 3D semantic maps.
Abstract
Accurately reconstructing dense and semantically annotated 3D meshes from monocular images remains a challenging task due to the lack of geometry guidance and imperfect view-dependent 2D priors. Though we have witnessed recent advancements in implicit neural scene representations enabling precise 2D rendering simply from multi-view images, there have been few works addressing 3D scene understanding with monocular priors alone. In this paper, we propose MOSE, a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D, producing accurate semantics and geometry in both 3D and 2D space. The key motivation for our method is to leverage generic class-agnostic segment masks as guidance to promote local consistency of rendered semantics during training. With the help of semantics, we further apply a smoothness regularization to texture-less regions for better geometric quality, thus achieving mutual benefits of geometry and semantics. Experiments on the ScanNet dataset show that our MOSE outperforms relevant baselines across all metrics on tasks of 3D semantic segmentation, 2D semantic segmentation and 3D surface reconstruction.
