Table of Contents
Fetching ...

Decoupling Dynamic Monocular Videos for Dynamic View Synthesis

Meng You, Junhui Hou

TL;DR

This work tackles dynamic view synthesis from monocular videos by unsupervisedly decoupling object motion from camera motion. It introduces two regularizations—surface consistency for temporal geometric stability and a patch-based multi-view constraint for cross-view appearance—to supervise NSFF-based dynamic NeRF without preprocessed optical flow or depth. Across NVIDIA Dynamic Scene and Neural 3D Video Synthesis datasets, the approach achieves state-of-the-art results on dynamic regions, along with improved scene flow and depth estimates, highlighting the viability of unsupervised motion decomposition. Limitations include handling non-rigid deformations and reliance on separating static/dynamic components, with opportunities to accelerate rendering and supplement with selective supervision in the future.

Abstract

The challenge of dynamic view synthesis from dynamic monocular videos, i.e., synthesizing novel views for free viewpoints given a monocular video of a dynamic scene captured by a moving camera, mainly lies in accurately modeling the \textbf{dynamic objects} of a scene using limited 2D frames, each with a varying timestamp and viewpoint. Existing methods usually require pre-processed 2D optical flow and depth maps by off-the-shelf methods to supervise the network, making them suffer from the inaccuracy of the pre-processed supervision and the ambiguity when lifting the 2D information to 3D. In this paper, we tackle this challenge in an unsupervised fashion. Specifically, we decouple the motion of the dynamic objects into object motion and camera motion, respectively regularized by proposed unsupervised surface consistency and patch-based multi-view constraints. The former enforces the 3D geometric surfaces of moving objects to be consistent over time, while the latter regularizes their appearances to be consistent across different viewpoints. Such a fine-grained motion formulation can alleviate the learning difficulty for the network, thus enabling it to produce not only novel views with higher quality but also more accurate scene flows and depth than existing methods requiring extra supervision.

Decoupling Dynamic Monocular Videos for Dynamic View Synthesis

TL;DR

This work tackles dynamic view synthesis from monocular videos by unsupervisedly decoupling object motion from camera motion. It introduces two regularizations—surface consistency for temporal geometric stability and a patch-based multi-view constraint for cross-view appearance—to supervise NSFF-based dynamic NeRF without preprocessed optical flow or depth. Across NVIDIA Dynamic Scene and Neural 3D Video Synthesis datasets, the approach achieves state-of-the-art results on dynamic regions, along with improved scene flow and depth estimates, highlighting the viability of unsupervised motion decomposition. Limitations include handling non-rigid deformations and reliance on separating static/dynamic components, with opportunities to accelerate rendering and supplement with selective supervision in the future.

Abstract

The challenge of dynamic view synthesis from dynamic monocular videos, i.e., synthesizing novel views for free viewpoints given a monocular video of a dynamic scene captured by a moving camera, mainly lies in accurately modeling the \textbf{dynamic objects} of a scene using limited 2D frames, each with a varying timestamp and viewpoint. Existing methods usually require pre-processed 2D optical flow and depth maps by off-the-shelf methods to supervise the network, making them suffer from the inaccuracy of the pre-processed supervision and the ambiguity when lifting the 2D information to 3D. In this paper, we tackle this challenge in an unsupervised fashion. Specifically, we decouple the motion of the dynamic objects into object motion and camera motion, respectively regularized by proposed unsupervised surface consistency and patch-based multi-view constraints. The former enforces the 3D geometric surfaces of moving objects to be consistent over time, while the latter regularizes their appearances to be consistent across different viewpoints. Such a fine-grained motion formulation can alleviate the learning difficulty for the network, thus enabling it to produce not only novel views with higher quality but also more accurate scene flows and depth than existing methods requiring extra supervision.
Paper Structure (14 sections, 16 equations, 10 figures, 7 tables)

This paper contains 14 sections, 16 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Given a monocular video of a dynamic scene captured by a moving camera, rendered novel views (zoomed in at dynamic part) at arbitrary viewpoints and any input timestep by state-of-the-art method Gao et al.gao2021dynamic and Ours. Please use Adobe Acrobat to display these videos.
  • Figure 2: Visual illustration of the limitation of existing method Gao et al. gao2021dynamic. The pre-processed 2D optical flow maps used for training networks contain severe errors, resulting in the corresponding parts of synthesized novel views being wrong.
  • Figure 3: Our approach follows existing methods by utilizing two neural radiance fields: a static NeRFmildenhall2020nerf for modeling the static background of a scene and an NSFFli2021neuralgao2021dynamic for modeling the dynamic objects and the scene flow. We train a Static NeRF model to reconstruct the scene background, excluding pixels marked as dynamic during training. We train a Dynamic NeRF that takes (x, d) and t as input for modeling dynamic objects. This allows the Dynamic NeRF to predict color $\textbf{c}$, density $\sigma$, 3D scene flow $\textbf{f}_f, \textbf{f}_b$, and blending weight $p$. The blending weight $p$ enables the composition of static and dynamic NeRF components.
  • Figure 4: Given two successive frames of a dynamic monocular video with their camera poses and timestamps being $(\mathbf{p}_a,~t_a)$ and $(\mathbf{p}_b,~t_b)$. Directly modeling the motion of the dynamic objects (highlighted in red frames) between them can be challenging. Here, we propose to decouple the motion between them into two types: (1) the motion caused by the object movement; and (2) the motion caused by the camera movement.
  • Figure 5: Illustration of the surface consistency constraint. Under the NeRF scheme, at timestamp $t$, we sample points (green points) on a ray and predict their density values to compute the weights they contribute to the rendering. The intersection point $\hat{\mathbf{x}}_t$ between the surface and the ray can be calculated by a weighted averaging of the coordinates of the sampled points. With NSFF, we can predict a 3D scene flow for each of the sampled points and warp them to the next timestamp $t+1$. Also, we can calculate the intersection point $\hat{\mathbf{x}}_{t+1}$ between the warped ray and the surface by a weighted averaging of the coordinates of the warped points. Our surface consistency constraint requires the difference between $\hat{\mathbf{x}}_{t}$ and $\hat{\mathbf{x}}_{t+1}$ should equal the scene flow predicted for the intersection point at $t$ by NSFF.
  • ...and 5 more figures