Table of Contents
Fetching ...

NeVStereo: A NeRF-Driven NVS-Stereo Architecture for High-Fidelity 3D Tasks

Pengcheng Chen, Yue Hu, Wenhao Li, Nicole M Gunderson, Andrew Feng, Zhenglong Sun, Peter Beerel, Eric J Seibel

TL;DR

NeVStereo presents a NeRF-driven NVS-stereo framework that jointly recovers camera poses, multi-view depth, novel-view synthesis, and surface geometry from casual RGB multi-view inputs. It introduces a multi-view confidence-guided RGB-D optimization (Mv-CG) and a NeRF-coupled bundle adjustment with iterative supervision to enforce cross-view geometric consistency and reduce surface artifacts. Across indoor, outdoor, tabletop, and aerial data, it achieves state-of-the-art performance in pose estimation, depth accuracy, NVS fidelity, and mesh quality, with strong zero-shot generalization. The approach demonstrates that integrating NeRF-based NVS with robust depth voting, refinement, and TSDF fusion can surpass traditional SfM ceilings while mitigating common NeRF artifacts, albeit with reliance on initialization and challenges under sparse views.

Abstract

In modern dense 3D reconstruction, feed-forward systems (e.g., VGGT, pi3) focus on end-to-end matching and geometry prediction but do not explicitly output the novel view synthesis (NVS). Neural rendering-based approaches offer high-fidelity NVS and detailed geometry from posed images, yet they typically assume fixed camera poses and can be sensitive to pose errors. As a result, it remains non-trivial to obtain a single framework that can offer accurate poses, reliable depth, high-quality rendering, and accurate 3D surfaces from casually captured views. We present NeVStereo, a NeRF-driven NVS-stereo architecture that aims to jointly deliver camera poses, multi-view depth, novel view synthesis, and surface reconstruction from multi-view RGB-only inputs. NeVStereo combines NeRF-based NVS for stereo-friendly renderings, confidence-guided multi-view depth estimation, NeRF-coupled bundle adjustment for pose refinement, and an iterative refinement stage that updates both depth and the radiance field to improve geometric consistency. This design mitigated the common NeRF-based issues such as surface stacking, artifacts, and pose-depth coupling. Across indoor, outdoor, tabletop, and aerial benchmarks, our experiments indicate that NeVStereo achieves consistently strong zero-shot performance, with up to 36% lower depth error, 10.4% improved pose accuracy, 4.5% higher NVS fidelity, and state-of-the-art mesh quality (F1 91.93%, Chamfer 4.35 mm) compared to existing prestigious methods.

NeVStereo: A NeRF-Driven NVS-Stereo Architecture for High-Fidelity 3D Tasks

TL;DR

NeVStereo presents a NeRF-driven NVS-stereo framework that jointly recovers camera poses, multi-view depth, novel-view synthesis, and surface geometry from casual RGB multi-view inputs. It introduces a multi-view confidence-guided RGB-D optimization (Mv-CG) and a NeRF-coupled bundle adjustment with iterative supervision to enforce cross-view geometric consistency and reduce surface artifacts. Across indoor, outdoor, tabletop, and aerial data, it achieves state-of-the-art performance in pose estimation, depth accuracy, NVS fidelity, and mesh quality, with strong zero-shot generalization. The approach demonstrates that integrating NeRF-based NVS with robust depth voting, refinement, and TSDF fusion can surpass traditional SfM ceilings while mitigating common NeRF artifacts, albeit with reliance on initialization and challenges under sparse views.

Abstract

In modern dense 3D reconstruction, feed-forward systems (e.g., VGGT, pi3) focus on end-to-end matching and geometry prediction but do not explicitly output the novel view synthesis (NVS). Neural rendering-based approaches offer high-fidelity NVS and detailed geometry from posed images, yet they typically assume fixed camera poses and can be sensitive to pose errors. As a result, it remains non-trivial to obtain a single framework that can offer accurate poses, reliable depth, high-quality rendering, and accurate 3D surfaces from casually captured views. We present NeVStereo, a NeRF-driven NVS-stereo architecture that aims to jointly deliver camera poses, multi-view depth, novel view synthesis, and surface reconstruction from multi-view RGB-only inputs. NeVStereo combines NeRF-based NVS for stereo-friendly renderings, confidence-guided multi-view depth estimation, NeRF-coupled bundle adjustment for pose refinement, and an iterative refinement stage that updates both depth and the radiance field to improve geometric consistency. This design mitigated the common NeRF-based issues such as surface stacking, artifacts, and pose-depth coupling. Across indoor, outdoor, tabletop, and aerial benchmarks, our experiments indicate that NeVStereo achieves consistently strong zero-shot performance, with up to 36% lower depth error, 10.4% improved pose accuracy, 4.5% higher NVS fidelity, and state-of-the-art mesh quality (F1 91.93%, Chamfer 4.35 mm) compared to existing prestigious methods.
Paper Structure (11 sections, 18 equations, 8 figures, 6 tables)

This paper contains 11 sections, 18 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: 3D reconstruction quality comparison. NeVStereo produces point clouds with more accurate geometry and significantly fewer floating artifacts than other methods.
  • Figure 2: Artifact comparison under degraded SfM initialization. While input-view renderings look similar, the novel views show that 3DGS produces sharper, structured artifacts than NeRF. These artifacts disrupt stereo correspondence and explain the larger accuracy drop of 3DGS-based NVS-stereo.
  • Figure 3: Architecture overview. Multi-view RGB inputs yield an initial SfM reconstruction and a coarse NeRF, which renders stereo pairs for depth estimation. The resulting depths are refined using our modified DROID-SLAM with multi-view depth voting and a NeRF-guided reprojection loss, followed by TSDF fusion and depth completion. The optimized depths then supervise a second-round NeRF refinement using our depth-guided Gaussian ray sampling for improved geometric accuracy. ★ Green highlights denote our outputs.
  • Figure 4: Stereo-depth projection without explicit multi-view constraints produces non-coincident, layered surfaces stacking. Even with accurate per-pixel depths, NeRF’s geometry can drift/ghost across views, breaking cross-view consistency. This cannot be solved by pose-optimized NeRF (BARF or CamP).
  • Figure 5: Pose optimization effectively eliminates the erroneous surface stacking shown in Fig. \ref{['fig:challenges']}. By refining camera poses under our proposed mechanism, the projected depths from different views converge onto a coherent surface, yielding clean and well-aligned geometry across all viewpoints.
  • ...and 3 more figures