Table of Contents
Fetching ...

Invertible Neural Warp for NeRF

Shin-Fang Chng, Ravi Garg, Hemanth Saratchandran, Simon Lucey

TL;DR

This work tackles the challenging problem of jointly optimizing camera poses and NeRF by moving away from explicit SE(3) pose parameterizations to an overparameterized, invertible ray-warp representation. It introduces an explicit Invertible Neural Network (INN) to model rigid ray transformations, coupled with a geometry-informed rigidity prior to preserve bijectivity and guide optimization. Across 2D planar tests, LLFF forward-facing scenes, and 360° DTU data, the INN-based approach yields substantial pose-accuracy gains (often exceeding 50% relative to SE(3)-based baselines) and improved high-fidelity reconstructions, outperforming BARF and L2G baselines. The results demonstrate that enforcing invertibility and leveraging homeomorphisms in warp representations can significantly enhance convergence and robustness in joint NeRF pose estimation and view synthesis, with implications for more reliable 3D reconstruction in challenging settings.

Abstract

This paper tackles the simultaneous optimization of pose and Neural Radiance Fields (NeRF). Departing from the conventional practice of using explicit global representations for camera pose, we propose a novel overparameterized representation that models camera poses as learnable rigid warp functions. We establish that modeling the rigid warps must be tightly coupled with constraints and regularization imposed. Specifically, we highlight the critical importance of enforcing invertibility when learning rigid warp functions via neural network and propose the use of an Invertible Neural Network (INN) coupled with a geometry-informed constraint for this purpose. We present results on synthetic and real-world datasets, and demonstrate that our approach outperforms existing baselines in terms of pose estimation and high-fidelity reconstruction due to enhanced optimization convergence.

Invertible Neural Warp for NeRF

TL;DR

This work tackles the challenging problem of jointly optimizing camera poses and NeRF by moving away from explicit SE(3) pose parameterizations to an overparameterized, invertible ray-warp representation. It introduces an explicit Invertible Neural Network (INN) to model rigid ray transformations, coupled with a geometry-informed rigidity prior to preserve bijectivity and guide optimization. Across 2D planar tests, LLFF forward-facing scenes, and 360° DTU data, the INN-based approach yields substantial pose-accuracy gains (often exceeding 50% relative to SE(3)-based baselines) and improved high-fidelity reconstructions, outperforming BARF and L2G baselines. The results demonstrate that enforcing invertibility and leveraging homeomorphisms in warp representations can significantly enhance convergence and robustness in joint NeRF pose estimation and view synthesis, with implications for more reliable 3D reconstruction in challenging settings.

Abstract

This paper tackles the simultaneous optimization of pose and Neural Radiance Fields (NeRF). Departing from the conventional practice of using explicit global representations for camera pose, we propose a novel overparameterized representation that models camera poses as learnable rigid warp functions. We establish that modeling the rigid warps must be tightly coupled with constraints and regularization imposed. Specifically, we highlight the critical importance of enforcing invertibility when learning rigid warp functions via neural network and propose the use of an Invertible Neural Network (INN) coupled with a geometry-informed constraint for this purpose. We present results on synthetic and real-world datasets, and demonstrate that our approach outperforms existing baselines in terms of pose estimation and high-fidelity reconstruction due to enhanced optimization convergence.
Paper Structure (38 sections, 6 equations, 6 figures, 4 tables)

This paper contains 38 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: We investigate how overparameterizing rigid warps of rays with an MLP benefits the joint optimization task of camera pose and NeRF. This example estimates the warps that align the color-coded patches in \ref{['fig:planar_exp_gt']} while solving for the neural field. Unlike "BARF" and "naive" MLP pose overparameterization methods fail catastrophically, enforcing invertibility, either implicitly (\ref{['fig:planar_exp_cmlp']}) or explicitly (\ref{['fig:planar_exp_inn']}) significantly improves warp estimation, see \ref{['subsec:exp_baselines']} for details of each method. We establish that invertibility is a crucial for MLP-based rigid warp representation.
  • Figure 2: An overview of our INN-based approach, illustrated using two views $\mathcal{I}_1$ and $\mathcal{I}_2$. INN which is denoted as $h_{\mathbf{\Theta}_{\mathcal{W}}}$ takes the pixel locations in the camera coordinate system $\mathbf{x}_{i,t}^{(\textcolor{red}{C})}$, along with the frame-dependent latent code $\Phi_{t}$, and output its corresponding location in the world coordinate system as $\mathbf{x}_{i,t}^{(\textcolor{blue}{W})}$, see \ref{['subsec:proposed']} for full details.
  • Figure 3: Basin convergence analysis of our approach versus BARF in a 2D planar experiment. On the left, we show the success rate of 20 runs, where we initialize the homography using groundtruth, and gradually introduced noise perturbations to the translation component. The noise scale is varied from $0$ to $0.30$. We used 5-pixel threshold to determine success convergence. Notably, our approach (blue) demonstrates higher noise tolerance compared to BARF (red). On the right, we show a qualitative comparison when both methods are perturbed with the highest magnitude of noise.
  • Figure 4: Qualitative analysis of reconstruction error on leaves (top) and trex (bottom). We present the average image reconstruction error through insets. Our approach presents the lowest misalignment error, as indicated darker areas in the error map.
  • Figure 5: Qualitative analysis of intermediate rendered image compared to superimposed Groundtruth (lighter visualization) when using L2G chen2023local and our approach on single-view pose estimation. Notably, when using our approach for pose estimation, we observe noticeable deformation in the rendered scene depicted in the bottom row (zoom in for better view). These deformations indicate that at each iteration, the INN predicts general homeomorphisms that are not rigid transformation. Thus yielding a flexible optimization trajectory that does not land in a suboptimal minimum trajectory.
  • ...and 1 more figures