Table of Contents
Fetching ...

AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization

Kun Wang, Zhiqiang Yan, Huang Tian, Zhenyu Zhang, Xiang Li, Jun Li, Jian Yang

TL;DR

AltNeRF tackles unstable NeRF reconstructions from monocular video by introducing an alternating depth-pose optimization that couples self-supervised monocular depth estimation (SMDE) with Neural Radiance Fields (NeRF). It deploys two modules: a Scene Prior Module (SPM) that provides depth and pose priors via SMDE, and a Scene Representation Module (SRM) that refines 3D geometry and camera poses with depth regularization, improved pose initialization, and warmup learning; these modules are wired through an alternating workflow with multi-view consistency checks. The approach achieves state-of-the-art novel-view synthesis and depth estimation on LLFF, ScanNet, CO3D, and a new Captures dataset, particularly excelling in textureless and indoor scenes and reducing reliance on COLMAP estimates. This work demonstrates that alternating depth-pose priors and NeRF optimization can robustly produce realistic 3D reconstructions from monocular video, with broad implications for scalable 3D scene capture and rendering in uncontrolled environments.

Abstract

Neural Radiance Fields (NeRF) have shown promise in generating realistic novel views from sparse scene images. However, existing NeRF approaches often encounter challenges due to the lack of explicit 3D supervision and imprecise camera poses, resulting in suboptimal outcomes. To tackle these issues, we propose AltNeRF -- a novel framework designed to create resilient NeRF representations using self-supervised monocular depth estimation (SMDE) from monocular videos, without relying on known camera poses. SMDE in AltNeRF masterfully learns depth and pose priors to regulate NeRF training. The depth prior enriches NeRF's capacity for precise scene geometry depiction, while the pose prior provides a robust starting point for subsequent pose refinement. Moreover, we introduce an alternating algorithm that harmoniously melds NeRF outputs into SMDE through a consistence-driven mechanism, thus enhancing the integrity of depth priors. This alternation empowers AltNeRF to progressively refine NeRF representations, yielding the synthesis of realistic novel views. Extensive experiments showcase the compelling capabilities of AltNeRF in generating high-fidelity and robust novel views that closely resemble reality.

AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization

TL;DR

AltNeRF tackles unstable NeRF reconstructions from monocular video by introducing an alternating depth-pose optimization that couples self-supervised monocular depth estimation (SMDE) with Neural Radiance Fields (NeRF). It deploys two modules: a Scene Prior Module (SPM) that provides depth and pose priors via SMDE, and a Scene Representation Module (SRM) that refines 3D geometry and camera poses with depth regularization, improved pose initialization, and warmup learning; these modules are wired through an alternating workflow with multi-view consistency checks. The approach achieves state-of-the-art novel-view synthesis and depth estimation on LLFF, ScanNet, CO3D, and a new Captures dataset, particularly excelling in textureless and indoor scenes and reducing reliance on COLMAP estimates. This work demonstrates that alternating depth-pose priors and NeRF optimization can robustly produce realistic 3D reconstructions from monocular video, with broad implications for scalable 3D scene capture and rendering in uncontrolled environments.

Abstract

Neural Radiance Fields (NeRF) have shown promise in generating realistic novel views from sparse scene images. However, existing NeRF approaches often encounter challenges due to the lack of explicit 3D supervision and imprecise camera poses, resulting in suboptimal outcomes. To tackle these issues, we propose AltNeRF -- a novel framework designed to create resilient NeRF representations using self-supervised monocular depth estimation (SMDE) from monocular videos, without relying on known camera poses. SMDE in AltNeRF masterfully learns depth and pose priors to regulate NeRF training. The depth prior enriches NeRF's capacity for precise scene geometry depiction, while the pose prior provides a robust starting point for subsequent pose refinement. Moreover, we introduce an alternating algorithm that harmoniously melds NeRF outputs into SMDE through a consistence-driven mechanism, thus enhancing the integrity of depth priors. This alternation empowers AltNeRF to progressively refine NeRF representations, yielding the synthesis of realistic novel views. Extensive experiments showcase the compelling capabilities of AltNeRF in generating high-fidelity and robust novel views that closely resemble reality.
Paper Structure (30 sections, 12 equations, 7 figures, 5 tables)

This paper contains 30 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) We showcase that NeRF is prone to fitting incorrect geometry, by back-projecting a training view to point cloud using estimated radiance and density. (b) We also show how NeRF creation is affected by inaccurate poses, by optimizing NeRF with COLMAP and our pose estimation.
  • Figure 2: (a) Existing methods establish a fixed target (the blue dot) using inaccurate depth prior, whereas we leverage valuable intermediate results from NeRF to dynamically adjust the objective (the green dots) towards the real depth (the red dot). (b) Pose refinement starting from different initial poses. The experiment is conducted on Vasedeck scene.
  • Figure 3: The overall pipeline of our AltNeRF. The scene prior module estimates depth and pose, which serves as the depth reference and initial poses, respectively. The scene representation module simultaneously refines the initial poses with $\Delta P_i$ and learns 3D scene representation, which is regularized by $D_i$, and produces more accurate poses $\hat{P}_{i+1}$ and finer depth maps $\hat{D}_{i+1}$. These refined depth and pose are then fed back to the scene prior module as guidance to improve its performance.
  • Figure 4: We illustrate that the depth estimation of SPM is improved by visualizing the initial depth estimate $D_0$, and depth estimates $D_1$ and $D_2$ after 1 and 2 alternating steps.
  • Figure 5: Qualitative comparisons of novel view synthesis and depth estimation on LLFF and Captures datasets.
  • ...and 2 more figures