Table of Contents
Fetching ...

Flow-NeRF: Joint Learning of Geometry, Poses, and Dense Flow within Unified Neural Representations

Xunzhi Zheng, Dan Xu

TL;DR

Flow-NeRF tackles pose-free NeRF by jointly learning camera poses, scene geometry, and dense optical flow within a unified neural representation. It introduces a two-branch architecture with shared point sampling, a pose-conditioned bijective mapping for dense novel-view flow via Real-NVP, and a feature message-passing path that distills flow information into geometry. The learning objective combines photometric, depth, point-cloud, and flow losses to produce accurate novel-view synthesis, depth estimation, pose prediction, and long-range novel-view flow. Experiments on Tanks & Temples, ScanNet, and Sintel demonstrate substantial improvements in NVS and depth and competitive long-range flow, enabling holistic scene modeling and meaningful correspondences across novel views.

Abstract

Learning accurate scene reconstruction without pose priors in neural radiance fields is challenging due to inherent geometric ambiguity. Recent development either relies on correspondence priors for regularization or uses off-the-shelf flow estimators to derive analytical poses. However, the potential for jointly learning scene geometry, camera poses, and dense flow within a unified neural representation remains largely unexplored. In this paper, we present Flow-NeRF, a unified framework that simultaneously optimizes scene geometry, camera poses, and dense optical flow all on-the-fly. To enable the learning of dense flow within the neural radiance field, we design and build a bijective mapping for flow estimation, conditioned on pose. To make the scene reconstruction benefit from the flow estimation, we develop an effective feature enhancement mechanism to pass canonical space features to world space representations, significantly enhancing scene geometry. We validate our model across four important tasks, i.e., novel view synthesis, depth estimation, camera pose prediction, and dense optical flow estimation, using several datasets. Our approach surpasses previous methods in almost all metrics for novel-view view synthesis and depth estimation and yields both qualitatively sound and quantitatively accurate novel-view flow. Our project page is https://zhengxunzhi.github.io/flownerf/.

Flow-NeRF: Joint Learning of Geometry, Poses, and Dense Flow within Unified Neural Representations

TL;DR

Flow-NeRF tackles pose-free NeRF by jointly learning camera poses, scene geometry, and dense optical flow within a unified neural representation. It introduces a two-branch architecture with shared point sampling, a pose-conditioned bijective mapping for dense novel-view flow via Real-NVP, and a feature message-passing path that distills flow information into geometry. The learning objective combines photometric, depth, point-cloud, and flow losses to produce accurate novel-view synthesis, depth estimation, pose prediction, and long-range novel-view flow. Experiments on Tanks & Temples, ScanNet, and Sintel demonstrate substantial improvements in NVS and depth and competitive long-range flow, enabling holistic scene modeling and meaningful correspondences across novel views.

Abstract

Learning accurate scene reconstruction without pose priors in neural radiance fields is challenging due to inherent geometric ambiguity. Recent development either relies on correspondence priors for regularization or uses off-the-shelf flow estimators to derive analytical poses. However, the potential for jointly learning scene geometry, camera poses, and dense flow within a unified neural representation remains largely unexplored. In this paper, we present Flow-NeRF, a unified framework that simultaneously optimizes scene geometry, camera poses, and dense optical flow all on-the-fly. To enable the learning of dense flow within the neural radiance field, we design and build a bijective mapping for flow estimation, conditioned on pose. To make the scene reconstruction benefit from the flow estimation, we develop an effective feature enhancement mechanism to pass canonical space features to world space representations, significantly enhancing scene geometry. We validate our model across four important tasks, i.e., novel view synthesis, depth estimation, camera pose prediction, and dense optical flow estimation, using several datasets. Our approach surpasses previous methods in almost all metrics for novel-view view synthesis and depth estimation and yields both qualitatively sound and quantitatively accurate novel-view flow. Our project page is https://zhengxunzhi.github.io/flownerf/.

Paper Structure

This paper contains 21 sections, 13 equations, 17 figures, 15 tables.

Figures (17)

  • Figure 1: Our Flow-NeRF model can simultaneously infer novel-view image, novel-view depth, and long-range novel-view flow without requiring pose prior. While we train the model solely on consecutive forward pseudo flow, it is capable of inferring both forward and backward long-range novel-view flows that are plausible (as illustrated in the two bottom images). In this figure, t+8 and t-8 denote novel-view forward and backward flow, respectively, with a frame interval of 8.
  • Figure 2: Overview of the proposed Flow-NeRF: Our method takes a sequence of images as input and jointly learns camera poses, scene geometry, and dense optical flow with a unified neural representation framework. We propose a shared points sampling mechanism to ensure the feature consistency between the geometry and flow branches (Sec. \ref{['shared points']}). We build a bijective mapping to query per-pixel motion given sampled points as input, conditioned on pose (Sec. \ref{['query flow']}). Leveraging the complementary nature of features between the world space and the 3D canonical volume, we enhance the feature representation of the geometry branch by message passing (Sec. \ref{['feature enhancement']}). We also develop effective loss functions to simultaneously learn flow and scene reconstruction, while imposing constraints on relative poses (Sec. \ref{['loss function']}).
  • Figure 3: Illustration of the details of the proposed shared points sampling mechanism (a) for both the geometry and flow branches and the feature message passing module (b) to couple them for learning a unified scene neural representation.
  • Figure 4: Latent feature embedding with time or pose condition. While time-conditioned input can only infer on train views, our pose-conditioned input can infer novel-view flow thanks to the geometry certainty that pose can produce.
  • Figure 5: Qualitative comparison with BARF lin2021barf and Nope-NeRF bian2023nope on novel view synthesis on the Tanks and Temples dataset. Our method achieves superior novel-view rendering quality with enhanced details.
  • ...and 12 more figures