Table of Contents
Fetching ...

ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization

Weiyao Wang, Pierre Gleize, Hao Tang, Xingyu Chen, Kevin J Liang, Matt Feiszli

TL;DR

ICON tackles the problem of learning Neural Radiance Fields from monocular video without pose initialization by introducing an incremental, confidence-guided optimization. It builds a Neural Confidence Field to dynamically reweight NeRF and pose gradients and couples incremental frame registrations with restart strategies and a Sampson-distance geometric constraint. The approach achieves state-of-the-art or competitive results on CO3D and HO3D, often outperforming SfM-based pose pipelines and matching RGB-D methods in dynamic object scenarios. This work advances camera-pose-free NeRF training and object-centric 3D reconstruction from RGB video, with potential for broader video inputs and reduced reliance on depth sensors.

Abstract

Neural Radiance Fields (NeRF) exhibit remarkable performance for Novel View Synthesis (NVS) given a set of 2D images. However, NeRF training requires accurate camera pose for each input view, typically obtained by Structure-from-Motion (SfM) pipelines. Recent works have attempted to relax this constraint, but they still often rely on decent initial poses which they can refine. Here we aim at removing the requirement for pose initialization. We present Incremental CONfidence (ICON), an optimization procedure for training NeRFs from 2D video frames. ICON only assumes smooth camera motion to estimate initial guess for poses. Further, ICON introduces ``confidence": an adaptive measure of model quality used to dynamically reweight gradients. ICON relies on high-confidence poses to learn NeRF, and high-confidence 3D structure (as encoded by NeRF) to learn poses. We show that ICON, without prior pose initialization, achieves superior performance in both CO3D and HO3D versus methods which use SfM pose.

ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization

TL;DR

ICON tackles the problem of learning Neural Radiance Fields from monocular video without pose initialization by introducing an incremental, confidence-guided optimization. It builds a Neural Confidence Field to dynamically reweight NeRF and pose gradients and couples incremental frame registrations with restart strategies and a Sampson-distance geometric constraint. The approach achieves state-of-the-art or competitive results on CO3D and HO3D, often outperforming SfM-based pose pipelines and matching RGB-D methods in dynamic object scenarios. This work advances camera-pose-free NeRF training and object-centric 3D reconstruction from RGB video, with potential for broader video inputs and reduced reliance on depth sensors.

Abstract

Neural Radiance Fields (NeRF) exhibit remarkable performance for Novel View Synthesis (NVS) given a set of 2D images. However, NeRF training requires accurate camera pose for each input view, typically obtained by Structure-from-Motion (SfM) pipelines. Recent works have attempted to relax this constraint, but they still often rely on decent initial poses which they can refine. Here we aim at removing the requirement for pose initialization. We present Incremental CONfidence (ICON), an optimization procedure for training NeRFs from 2D video frames. ICON only assumes smooth camera motion to estimate initial guess for poses. Further, ICON introduces ``confidence": an adaptive measure of model quality used to dynamically reweight gradients. ICON relies on high-confidence poses to learn NeRF, and high-confidence 3D structure (as encoded by NeRF) to learn poses. We show that ICON, without prior pose initialization, achieves superior performance in both CO3D and HO3D versus methods which use SfM pose.
Paper Structure (18 sections, 6 equations, 6 figures, 10 tables)

This paper contains 18 sections, 6 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Novel view and pose visualizations of ICON and BARF when no initial pose is available. We train on a flyaround video of book from CO3D reizenstein21co3d. BARF trajectories exhibit fragmentation: camera poses split into two forward-facing clusters and create two books. ICON provides high-quality view synthesis and recovers poses very precisely. The colored triangle meshes represent ICON predicted poses and grey ones represent groundtruth.
  • Figure 2: ICON overview. ICON constructs a Neural Confidence field on top of NeRF to encode confidence $\zeta$ for each 3D location. The confidence is then used to guide the optimization process.
  • Figure 3: Three major failure modes of joint pose and NeRF optimization: fragmentation, Bas Relief, and overlapping registration. The colored poses are predictions; grey poses are groundtruth. Fragmentation: Pose and NeRF break apart, producing separate, mutually invisible radiance fields. Here a tube of toytrucks is created, each occluding the next. Poses fly through this tube flipbook-style, each seeing a single toytruck. See also Fig. \ref{['fig:vis_teaser']}, where completely independent reconstructions occur in different regions of 3-space. Bas Relief: Due to an inherent ambiguity in RGB reconstruction, the model constructs a "relief" by creating a concave apple inside the table, which results in camera trajectories inverted by 180 degrees. Overlapping Registration: Two subsets of the pose trajectory are trapped in a local minimum, incorrectly observing the same part of the radiance field , leading to blurry rendering and empty voxels. Here, one side of the toaster is blurry due to overlapping views, while the other has no views and is vacant.
  • Figure 4: Novel view synthesis visualization of ICON without poses and NeRF trained with GT poses. Despite having no pose priors, ICON renders novel views at comparable or higher quality. Results are taken from LLFF and CO3D.
  • Figure 5: Visualization of ICON novel view synthesis on HO3D. ICON can recover shapes and textures accurately.
  • ...and 1 more figures