Table of Contents
Fetching ...

MonoPatchNeRF: Improving Neural Radiance Fields with Patch-based Monocular Guidance

Yuqun Wu, Jae Yong Lee, Chuhang Zou, Shenlong Wang, Derek Hoiem

TL;DR

MonoPatchNeRF tackles geometry accuracy and view extrapolation in large-scale scenes with sparse views by integrating patch-based sampling and monocular cues into a density-based NeRF. It distills patch-level depth and normals, enforces patch-based photometric consistency over virtual views, and restricts density using sparse SfM geometry to prevent floaters. The method achieves state-of-the-art geometry on ETH3D while maintaining competitive novel-view synthesis and offers faster training and inference than comparable regularized NeRFs. These contributions provide a practical balance between geometric accuracy and rendering quality for large-scale, sparsely photographed scenes.

Abstract

The latest regularized Neural Radiance Field (NeRF) approaches produce poor geometry and view extrapolation for large scale sparse view scenes, such as ETH3D. Density-based approaches tend to be under-constrained, while surface-based approaches tend to miss details. In this paper, we take a density-based approach, sampling patches instead of individual rays to better incorporate monocular depth and normal estimates and patch-based photometric consistency constraints between training views and sampled virtual views. Loosely constraining densities based on estimated depth aligned to sparse points further improves geometric accuracy. While maintaining similar view synthesis quality, our approach significantly improves geometric accuracy on the ETH3D benchmark, e.g. increasing the F1@2cm score by 4x-8x compared to other regularized density-based approaches, with much lower training and inference time than other approaches.

MonoPatchNeRF: Improving Neural Radiance Fields with Patch-based Monocular Guidance

TL;DR

MonoPatchNeRF tackles geometry accuracy and view extrapolation in large-scale scenes with sparse views by integrating patch-based sampling and monocular cues into a density-based NeRF. It distills patch-level depth and normals, enforces patch-based photometric consistency over virtual views, and restricts density using sparse SfM geometry to prevent floaters. The method achieves state-of-the-art geometry on ETH3D while maintaining competitive novel-view synthesis and offers faster training and inference than comparable regularized NeRFs. These contributions provide a practical balance between geometric accuracy and rendering quality for large-scale, sparsely photographed scenes.

Abstract

The latest regularized Neural Radiance Field (NeRF) approaches produce poor geometry and view extrapolation for large scale sparse view scenes, such as ETH3D. Density-based approaches tend to be under-constrained, while surface-based approaches tend to miss details. In this paper, we take a density-based approach, sampling patches instead of individual rays to better incorporate monocular depth and normal estimates and patch-based photometric consistency constraints between training views and sampled virtual views. Loosely constraining densities based on estimated depth aligned to sparse points further improves geometric accuracy. While maintaining similar view synthesis quality, our approach significantly improves geometric accuracy on the ETH3D benchmark, e.g. increasing the F1@2cm score by 4x-8x compared to other regularized density-based approaches, with much lower training and inference time than other approaches.
Paper Structure (18 sections, 6 equations, 16 figures, 7 tables)

This paper contains 18 sections, 6 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Overview of our architecture. Our MonoPatchNeRF contains three major types of losses: 1) color supervision of RGB images, 2) geometric supervision of monocular depth and normal maps, and 3) virtual view patches regularization between randomly sampled patches and corresponding ground truth RGB pixels. We sample the virtual view pose via random translations from the training view camera center, and obtain the virtual view corresponding patch by rendering along the back-projected ray that is unprojected with the rendered depth from the training view (Figure \ref{['fig:nvp']}). Additionally, we limit the density search space by pruning out the regions using the monocular geometry (Figure \ref{['fig:empty_space_pruning']}).
  • Figure 2: Virtual view patch sampling and occlusion visualization. (a) We first sample a virtual center $\textbf{o}_{p^\ast}$ near the training center $\textbf{o}_{p}$. We then unproject the training patch to $\{\textbf{X}_p\}$ with rendered depth, and project $\{\textbf{X}_p\}$ to $\textbf{o}_{p^\ast}$ for the virtual patch viewing directions. Color of the virtual patch are rendered and compared to the ground truth RGB in the training patch. (b) We unproject the virtual patch to $\{\textbf{X}_{p^\ast}\}$ with virtual rendered depth, and mask pixels based on the angle $\{\theta_{p^\ast \rightarrow p}\}$ between $\{\textbf{X}_p\}$ to $\textbf{o}_p$ and $\{\textbf{X}_{p^\ast}\}$ to $\textbf{o}_p$. For simplicity, the visualization only contains a single pixel $p$.
  • Figure 3: Visualization of density restrictions. On the left, we present the point cloud reconstruction of our model trained with density restrictions. On the right, a vertical slice of the reconstructed scene is shown, both with and without density restrictions. The original scene points and color points (green and red) represent our reconstructed point cloud with and without density restrictions, respectively. The blue area denotes density-restricted voxels. With density restrictions, the ground is accurately reconstructed as a plane, whereas without density restrictions, the ground sinks down.
  • Figure 4: Qualitative results on ETH3D schops2017multi. We visualize the rendered RGB, depth and normal map of the test views and the complete geometry reconstruction on the facade of ETH3D schops2017multi for our method and baselines niemeyer2022regnerfyang2023freenerfyu2022monosdfli2023neuralangelo. We zoom in on challenging areas such as lamps and stairs to highlight the difference. The depths of patches are re-normalized for visualization purposes. The geometry of MonoSDF and Neuralangelo is a mesh, and the geometry of other methods is a projected point cloud. Best viewed when zoomed in.
  • Figure 5: Qualitative comparison of novel view images and meshes. We provide test view rendered images and meshes on the ETH3D dataset schops2017multi. The mesh of Ours, RegNeRF niemeyer2022regnerf and FreeNeRF yang2023freenerf are generated via TSDF fusion given predicted RGBD sequence. Best viewed when zoomed in.
  • ...and 11 more figures