Table of Contents
Fetching ...

MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface Reconstruction

Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, Andreas Geiger

TL;DR

Neural implicit surface reconstruction from multi-view RGB images suffers in large or textureless scenes and when viewpoints are sparse due to RGB-only constraints. MonoSDF integrates monocular depth and normal cues predicted by a pretrained monocular model into the optimization of SDF-based scene representations, and systematically compares four architectural choices (dense SDF grids, single MLP, single-resolution grids, and multi-resolution grids). Across object-level and scene-level datasets (DTU, Replica, ScanNet, Tanks & Temples), monocular priors consistently improve reconstruction quality and accelerate convergence, with multi-resolution grids offering fast optimization and detail capture while MLPs provide strong global priors and robustness to noise. The work demonstrates that monocular priors are a practical, scalable means to extend neural implicit surface methods to more complex and larger-scale environments.

Abstract

In recent years, neural implicit surface reconstruction methods have become popular for multi-view 3D reconstruction. In contrast to traditional multi-view stereo methods, these approaches tend to produce smoother and more complete reconstructions due to the inductive smoothness bias of neural networks. State-of-the-art neural implicit methods allow for high-quality reconstructions of simple scenes from many input views. Yet, their performance drops significantly for larger and more complex scenes and scenes captured from sparse viewpoints. This is caused primarily by the inherent ambiguity in the RGB reconstruction loss that does not provide enough constraints, in particular in less-observed and textureless areas. Motivated by recent advances in the area of monocular geometry prediction, we systematically explore the utility these cues provide for improving neural implicit surface reconstruction. We demonstrate that depth and normal cues, predicted by general-purpose monocular estimators, significantly improve reconstruction quality and optimization time. Further, we analyse and investigate multiple design choices for representing neural implicit surfaces, ranging from monolithic MLP models over single-grid to multi-resolution grid representations. We observe that geometric monocular priors improve performance both for small-scale single-object as well as large-scale multi-object scenes, independent of the choice of representation.

MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface Reconstruction

TL;DR

Neural implicit surface reconstruction from multi-view RGB images suffers in large or textureless scenes and when viewpoints are sparse due to RGB-only constraints. MonoSDF integrates monocular depth and normal cues predicted by a pretrained monocular model into the optimization of SDF-based scene representations, and systematically compares four architectural choices (dense SDF grids, single MLP, single-resolution grids, and multi-resolution grids). Across object-level and scene-level datasets (DTU, Replica, ScanNet, Tanks & Temples), monocular priors consistently improve reconstruction quality and accelerate convergence, with multi-resolution grids offering fast optimization and detail capture while MLPs provide strong global priors and robustness to noise. The work demonstrates that monocular priors are a practical, scalable means to extend neural implicit surface methods to more complex and larger-scale environments.

Abstract

In recent years, neural implicit surface reconstruction methods have become popular for multi-view 3D reconstruction. In contrast to traditional multi-view stereo methods, these approaches tend to produce smoother and more complete reconstructions due to the inductive smoothness bias of neural networks. State-of-the-art neural implicit methods allow for high-quality reconstructions of simple scenes from many input views. Yet, their performance drops significantly for larger and more complex scenes and scenes captured from sparse viewpoints. This is caused primarily by the inherent ambiguity in the RGB reconstruction loss that does not provide enough constraints, in particular in less-observed and textureless areas. Motivated by recent advances in the area of monocular geometry prediction, we systematically explore the utility these cues provide for improving neural implicit surface reconstruction. We demonstrate that depth and normal cues, predicted by general-purpose monocular estimators, significantly improve reconstruction quality and optimization time. Further, we analyse and investigate multiple design choices for representing neural implicit surfaces, ranging from monolithic MLP models over single-grid to multi-resolution grid representations. We observe that geometric monocular priors improve performance both for small-scale single-object as well as large-scale multi-object scenes, independent of the choice of representation.
Paper Structure (30 sections, 21 equations, 21 figures, 13 tables)

This paper contains 30 sections, 21 equations, 21 figures, 13 tables.

Figures (21)

  • Figure 1: MonoSDF. Top: State-of-the-art neural implicit surface reconstruction methods fail in the presence of limited input views or when applied to complex multi-object scenes. Bottom: We demonstrate that incorporating geometric cues from general-purpose monocular predictors enables scaling to larger scenes while yielding more accurate reconstructions and speeding up optimization. An image resolution of $384\times384$ pixels was used for all results shown above.
  • Figure 2: Overview. In this work we use monocular geometric cues predicted by a general-purpose pretrained network to guide the optimization of neural implicit surface models. More specifically, for a batch of rays, we volume render predicted RGB colors, depth, and normals, and optimize wrt. the input RGB images and monocular geometric cues. Further, we investigate different design choices for neural implicit architectures and provide an in-depth analysis. For clarity, we only show the SDF and not the color prediction branch above.
  • Figure 3: Architectural Ablation Study. Comparing different design choices for neural implicit surface representations, we observe that a dense SDF grid leads to noisy reconstructions due to a missing smoothness bias. The MLP and the Single-Res. Fea. Grid improve results, but geometry tends to be overly smooth with missing details. The best results are obtained using Multi-Res. Fea. Grids.
  • Figure 4: Ablation of Monocular Geometric Cues. Monocular geometric cues significantly improve reconstruction quality for both architectures (we show our MLP variant). With monocular depth cues, the recovered geometry contains more details and a better overall structure. With normal cues, missing details are added and the results become smoother. Using both cues leads to the best performance.
  • Figure 5: Architectures. We show an overview over four different scene representations considered in this paper.
  • ...and 16 more figures