Table of Contents
Fetching ...

Pseudo-Generalized Dynamic View Synthesis from a Video

Xiaoming Zhao, Alex Colburn, Fangchang Ma, Miguel Angel Bautista, Joshua M. Susskind, Alexander G. Schwing

TL;DR

The paper probes whether generalized dynamic novel-view synthesis from monocular videos is achievable and proposes a pseudo-generalized framework that avoids scene-specific appearance fitting but relies on consistent depth estimates. Static content is rendered via an adapted generalizable NeRF Transformer with masked attention to handle dynamic occlusions, while dynamic content is reconstructed from depth- and time-based priors, including track-based temporal information. Experiments on NVIDIA Dynamic Scenes and DyCheck demonstrate competitive performance against some scene-specific baselines, highlighting the value of depth priors for generalization and outlining limitations due to depth/tracking quality. The work emphasizes the need for advances in monocular depth estimation and temporal aggregation to move closer to fully generalized dynamic NVS from monocular videos.

Abstract

Rendering scenes observed in a monocular video from novel viewpoints is a challenging problem. For static scenes the community has studied both scene-specific optimization techniques, which optimize on every test scene, and generalized techniques, which only run a deep net forward pass on a test scene. In contrast, for dynamic scenes, scene-specific optimization techniques exist, but, to our best knowledge, there is currently no generalized method for dynamic novel view synthesis from a given monocular video. To answer whether generalized dynamic novel view synthesis from monocular videos is possible today, we establish an analysis framework based on existing techniques and work toward the generalized approach. We find a pseudo-generalized process without scene-specific appearance optimization is possible, but geometrically and temporally consistent depth estimates are needed. Despite no scene-specific appearance optimization, the pseudo-generalized approach improves upon some scene-specific methods.

Pseudo-Generalized Dynamic View Synthesis from a Video

TL;DR

The paper probes whether generalized dynamic novel-view synthesis from monocular videos is achievable and proposes a pseudo-generalized framework that avoids scene-specific appearance fitting but relies on consistent depth estimates. Static content is rendered via an adapted generalizable NeRF Transformer with masked attention to handle dynamic occlusions, while dynamic content is reconstructed from depth- and time-based priors, including track-based temporal information. Experiments on NVIDIA Dynamic Scenes and DyCheck demonstrate competitive performance against some scene-specific baselines, highlighting the value of depth priors for generalization and outlining limitations due to depth/tracking quality. The work emphasizes the need for advances in monocular depth estimation and temporal aggregation to move closer to fully generalized dynamic NVS from monocular videos.

Abstract

Rendering scenes observed in a monocular video from novel viewpoints is a challenging problem. For static scenes the community has studied both scene-specific optimization techniques, which optimize on every test scene, and generalized techniques, which only run a deep net forward pass on a test scene. In contrast, for dynamic scenes, scene-specific optimization techniques exist, but, to our best knowledge, there is currently no generalized method for dynamic novel view synthesis from a given monocular video. To answer whether generalized dynamic novel view synthesis from monocular videos is possible today, we establish an analysis framework based on existing techniques and work toward the generalized approach. We find a pseudo-generalized process without scene-specific appearance optimization is possible, but geometrically and temporally consistent depth estimates are needed. Despite no scene-specific appearance optimization, the pseudo-generalized approach improves upon some scene-specific methods.
Paper Structure (29 sections, 8 equations, 11 figures, 8 tables)

This paper contains 29 sections, 8 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: (a) We find that it is possible to get rid of scene-specific appearance optimization for novel view synthesis from monocular videos, which could take up to hundreds of GPU hours per video, while still outperforming several scene-specific approaches on the NVIDIA Dynamic Scenes Yoon2020NovelVS. (b) Qualitative results demonstrate the feasibility of generalized approaches as our rendering quality is on-par or better (see details of the dragon balloon), even though we do not use any scene-specific appearance optimization. The face is masked out to protect privacy.
  • Figure 2: Framework overview (Sec. \ref{['sec: approach overview']}). Our analysis framework separately renders static (Sec. \ref{['sec: st render']}) and dynamic content (Sec. \ref{['sec: dy render']}). We focus on exploiting depth and temporal data priors, which are commonly utilized in scene-specific optimization methods.
  • Figure 3: Static content rendering (Sec. \ref{['sec: st render']}). Samples on a target ray (shown in (a)), project to epipolar lines in source views (see (b)). Vanilla GNT does not consider dynamic content, i.e., no utilization of dynamic masks (green arrows) in (b), and produces artifacts as shown in (c).(i), e.g., the rendering for static walls is contaminated by the foreground balloon. We find those artifacts correlate with the standard deviation of sampled features across source views as visualized in (c).(ii). Our proposal of using masked attention (Sec. \ref{['sec: gnt adapt']}) based on dynamic masks in GNT's view transformer improves the static content rendering as shown in (c).(iii). The face is masked to protect privacy.
  • Figure 4: Dynamic content rendering (Sec. \ref{['sec: dy render']}). (a) For two temporally closest source views (indices $i_\text{tgt}^-$ and $i_\text{tgt}^+$), with the help of depth, dynamic mask, as well as camera information, 2D dynamic content, e.g., human and balloon in this figure, is lifted into two point clouds. They are then connected based on optical flow between $I_{i_\text{tgt}^-}$ and $I_{i_\text{tgt}^+}$ (Sec. \ref{['sec: dy depth']}). Correspondences are highlighted as "rainbow" lines in the point cloud set $\mathcal{P}$. We then obtain the point cloud $\mathcal{P}_\text{tgt}$ for target time $t_\text{tgt}$ based on the linear motion assumption. (b) We utilize temporal priors of 2D tracking and lift visible trajectories to 3D with depth priors (Sec. \ref{['sec: dy temporal']}). A complementary target point cloud is obtained based on the linear motion assumption as well and aggregated with $\mathcal{P}_\text{tgt}$. For example, for the shown 3D track, we linearly interpolated 3D positions from $i_\text{tgt}^+ - 1$ and $i_\text{tgt}^+ + 2$ as they are temporally closest to $i_\text{tgt}$ in the visible tracking (track on $i_\text{tgt}^+ + 1$ is invisible). The face is masked to protect privacy.
  • Figure 5: Qualitative results on NVIDIA Dynamic Scenes. Even without scene-specific appearance optimization, the method still produces on-par or higher quality renderings than some scene-specific approaches: our background is aligned well with the ground-truth, our highlighted foreground is sharper than DVS and NSFF, and we do not miss the right arm as DynIBaR does in the bottom row. Faces are masked to protect privacy.
  • ...and 6 more figures