Table of Contents
Fetching ...

Factorized Motion Fields for Fast Sparse Input Dynamic View Synthesis

Nagabhushan Somraj, Kapil Choudhary, Sai Harsha Mupparaju, Rajiv Soundararajan

TL;DR

This work addresses fast dynamic view synthesis from sparse multi-view footage by introducing RF-DeRF, an explicit, factorized dynamic radiance-field framework. It combines a 5D canonical radiance field with a 4D scene flow, both implemented as fast, hex-plane factorized volumes, and regularizes motion with a complementary pair of flow priors: sparse cross-camera flow via SIFT-based keypoints and dense within-camera flow via RAFT. The approach yields state-of-the-art results on N3DV and InterDigital datasets under sparse viewpoints, while maintaining practical training and rendering speeds and modest memory. The method reduces reliance on dense input views and provides a practical path toward real-time or near-real-time dynamic view synthesis in sparse-view scenarios.

Abstract

Designing a 3D representation of a dynamic scene for fast optimization and rendering is a challenging task. While recent explicit representations enable fast learning and rendering of dynamic radiance fields, they require a dense set of input viewpoints. In this work, we focus on learning a fast representation for dynamic radiance fields with sparse input viewpoints. However, the optimization with sparse input is under-constrained and necessitates the use of motion priors to constrain the learning. Existing fast dynamic scene models do not explicitly model the motion, making them difficult to be constrained with motion priors. We design an explicit motion model as a factorized 4D representation that is fast and can exploit the spatio-temporal correlation of the motion field. We then introduce reliable flow priors including a combination of sparse flow priors across cameras and dense flow priors within cameras to regularize our motion model. Our model is fast, compact and achieves very good performance on popular multi-view dynamic scene datasets with sparse input viewpoints. The source code for our model can be found on our project page: https://nagabhushansn95.github.io/publications/2024/RF-DeRF.html.

Factorized Motion Fields for Fast Sparse Input Dynamic View Synthesis

TL;DR

This work addresses fast dynamic view synthesis from sparse multi-view footage by introducing RF-DeRF, an explicit, factorized dynamic radiance-field framework. It combines a 5D canonical radiance field with a 4D scene flow, both implemented as fast, hex-plane factorized volumes, and regularizes motion with a complementary pair of flow priors: sparse cross-camera flow via SIFT-based keypoints and dense within-camera flow via RAFT. The approach yields state-of-the-art results on N3DV and InterDigital datasets under sparse viewpoints, while maintaining practical training and rendering speeds and modest memory. The method reduces reliance on dense input views and provides a practical path toward real-time or near-real-time dynamic view synthesis in sparse-view scenarios.

Abstract

Designing a 3D representation of a dynamic scene for fast optimization and rendering is a challenging task. While recent explicit representations enable fast learning and rendering of dynamic radiance fields, they require a dense set of input viewpoints. In this work, we focus on learning a fast representation for dynamic radiance fields with sparse input viewpoints. However, the optimization with sparse input is under-constrained and necessitates the use of motion priors to constrain the learning. Existing fast dynamic scene models do not explicitly model the motion, making them difficult to be constrained with motion priors. We design an explicit motion model as a factorized 4D representation that is fast and can exploit the spatio-temporal correlation of the motion field. We then introduce reliable flow priors including a combination of sparse flow priors across cameras and dense flow priors within cameras to regularize our motion model. Our model is fast, compact and achieves very good performance on popular multi-view dynamic scene datasets with sparse input viewpoints. The source code for our model can be found on our project page: https://nagabhushansn95.github.io/publications/2024/RF-DeRF.html.
Paper Structure (18 sections, 7 equations, 13 figures, 12 tables)

This paper contains 18 sections, 7 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Model architecture: We decompose the dynamic radiance field into a 4D scene flow or deformation field $\mathcal{F}_{f}$ that maps a 3D point $\mathbf{p}_i$ at time $t$ to the corresponding 3D point $\mathbf{p}_i'$ at canonical time $t'$, and a 5D radiance field $\mathcal{F}_{s}$ that models the scene at canonical time $t'$. Both the fields are modeled using a factorized volume followed by a tiny MLP, which allows fast optimization and rendering. We note that $\mathbf{G}_f$ is modeled using six planes, although we show only three owing to the difficulty in visualizing four dimensions. The MLP $\mathbf{M}_s$ is conditioned on time and viewing direction to model time-dependent color variations such as shadows and view-dependent color variations such as specularities. The output of $\mathcal{F}_{s}$ is volume rendered to obtain the color of the pixel and the photometric loss is used to train both the fields. The explicitly modeled motion field $\mathcal{F}_{f}$ is additionally regularized using the flow priors as shown in \ref{['fig:loss-flow']}.
  • Figure 2: Flow regularization: Since the motion field $\mathcal{F}_{f}$ gives only the unidirectional flow from time $t$ to $t'$, we impose the flow prior by minimizing the distance between the 3D points in the canonical volume corresponding to the matched pixels $(\mathbf{q}_t^v, \mathbf{q}_s^u)$ in the input frames $(I_t^v,I_s^u)$.
  • Figure 3: Visualization of different flow priors: We show the matched pixels as provided by different flow priors. The pixels in the first view are randomly picked from those for which sparse flow is available and the same pixels are used for dense flow. Note that the second view in the first two examples has more blur as compared to the first view. (a) We show that the dense flow priors across cameras obtained using deep optical flow networks such as RAFT teed2020raft are prone to erroneous matches, due to variations in camera parameters and lighting. We observed similar trends with other deep optical flow networks as well such as AR-Flow liu2020arflow. (b) Matching pixels across cameras using robust SIFT features provides reliable matches, albeit sparse. (c) Within individual cameras, the dense correspondences provided by deep optical flow networks are more reliable owing to smaller variations in lighting.
  • Figure 4: Qualitative examples on N3DV dataset with two input views: We can observe that K-Planes finds it hard to learn the moving person leading to significant distortions. Our DeRF model (without any priors) corrects a few errors by virtue of the common canonical volume. Imposing our priors leads to much better reconstruction.
  • Figure 5: Qualitative examples for depth priors vs flow priors: We observe that sparse depth prior is not very effective in improving the reconstruction quality. However, our sparse flow prior is highly effective in mitigating the distortions, while the use of sparse and dense flow priors in our final model leads to the best reconstruction quality.
  • ...and 8 more figures