Table of Contents
Fetching ...

MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction

Wenyuan Zhang, Yixiao Yang, Han Huang, Liang Han, Kanle Shi, Yu-Shen Liu, Zhizhong Han

TL;DR

This work addresses the instability of monocular depth priors in multi-view neural rendering by aligning instance-level depths across views to form a unified 3D representation and estimating point-density-based uncertainty. It introduces MonoInstance, which uses the resulting uncertainty maps to adapt depth supervision, guide ray sampling, and apply an uncertainty-based instance-mask constraint, thereby improving reconstruction and novel-view synthesis across dense and sparse settings. The method demonstrates state-of-the-art performance on ScanNet, Replica, DTU, and LLFF benchmarks, while remaining a plug-in that can be integrated with various multi-view rendering frameworks. The approach offers a practical pathway to more robust monocular priors in real-world scenes, enhancing geometric fidelity and rendering quality.

Abstract

Monocular depth priors have been widely adopted by neural rendering in multi-view based tasks such as 3D reconstruction and novel view synthesis. However, due to the inconsistent prediction on each view, how to more effectively leverage monocular cues in a multi-view context remains a challenge. Current methods treat the entire estimated depth map indiscriminately, and use it as ground truth supervision, while ignoring the inherent inaccuracy and cross-view inconsistency in monocular priors. To resolve these issues, we propose MonoInstance, a general approach that explores the uncertainty of monocular depths to provide enhanced geometric priors for neural rendering and reconstruction. Our key insight lies in aligning each segmented instance depths from multiple views within a common 3D space, thereby casting the uncertainty estimation of monocular depths into a density measure within noisy point clouds. For high-uncertainty areas where depth priors are unreliable, we further introduce a constraint term that encourages the projected instances to align with corresponding instance masks on nearby views. MonoInstance is a versatile strategy which can be seamlessly integrated into various multi-view neural rendering frameworks. Our experimental results demonstrate that MonoInstance significantly improves the performance in both reconstruction and novel view synthesis under various benchmarks.

MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction

TL;DR

This work addresses the instability of monocular depth priors in multi-view neural rendering by aligning instance-level depths across views to form a unified 3D representation and estimating point-density-based uncertainty. It introduces MonoInstance, which uses the resulting uncertainty maps to adapt depth supervision, guide ray sampling, and apply an uncertainty-based instance-mask constraint, thereby improving reconstruction and novel-view synthesis across dense and sparse settings. The method demonstrates state-of-the-art performance on ScanNet, Replica, DTU, and LLFF benchmarks, while remaining a plug-in that can be integrated with various multi-view rendering frameworks. The approach offers a practical pathway to more robust monocular priors in real-world scenes, enhancing geometric fidelity and rendering quality.

Abstract

Monocular depth priors have been widely adopted by neural rendering in multi-view based tasks such as 3D reconstruction and novel view synthesis. However, due to the inconsistent prediction on each view, how to more effectively leverage monocular cues in a multi-view context remains a challenge. Current methods treat the entire estimated depth map indiscriminately, and use it as ground truth supervision, while ignoring the inherent inaccuracy and cross-view inconsistency in monocular priors. To resolve these issues, we propose MonoInstance, a general approach that explores the uncertainty of monocular depths to provide enhanced geometric priors for neural rendering and reconstruction. Our key insight lies in aligning each segmented instance depths from multiple views within a common 3D space, thereby casting the uncertainty estimation of monocular depths into a density measure within noisy point clouds. For high-uncertainty areas where depth priors are unreliable, we further introduce a constraint term that encourages the projected instances to align with corresponding instance masks on nearby views. MonoInstance is a versatile strategy which can be seamlessly integrated into various multi-view neural rendering frameworks. Our experimental results demonstrate that MonoInstance significantly improves the performance in both reconstruction and novel view synthesis under various benchmarks.

Paper Structure

This paper contains 14 sections, 10 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of our method. We take multi-view 3D reconstruction through NeRF based rendering as an example. (a) Starting from multi-view consistent instance segmentation and estimated monocular depths, we align the same instance from different viewpoints by back-projecting instance depths into a point cloud. The monocular inconsistent clues across different views become a measurement of density estimation in neighborhood of each point, leading to uncertainty maps (Sec. \ref{['sec:uncertainty']}). The estimated uncertainty maps are further utilized in (b) neural rendering pipeline to guide adaptive depth loss, ray sampling (Sec. \ref{['sec:optimization']}) and (c) instance mask constraints (Sec. \ref{['sec:silhouette']}).
  • Figure 2: Illustration of uncertainty estimation. Areas with inconsistent depths (chair legs) correspond to more dispersed point cloud areas with low density (few points) in a neighborhood, indicating high uncertainty. In contrast, areas with accurate depths (chair seats) correspond to the points that are densely distributed on the true surface, indicating low uncertainty.
  • Figure 3: Visual comparison of the estimated uncertainty maps between DebSDF and ours. Our method estimates sharp uncertainty maps that faithfully capture the fine-grained geometric structures.
  • Figure 4: Visual comparisons of dense-view 3D reconstruction on ScanNet and Replica dataset.
  • Figure 5: Visual comparisons on DTU dataset under the task of little-overlapping sparse input reconstruction.
  • ...and 3 more figures