Pseudo-Generalized Dynamic View Synthesis from a Video
Xiaoming Zhao, Alex Colburn, Fangchang Ma, Miguel Angel Bautista, Joshua M. Susskind, Alexander G. Schwing
TL;DR
The paper probes whether generalized dynamic novel-view synthesis from monocular videos is achievable and proposes a pseudo-generalized framework that avoids scene-specific appearance fitting but relies on consistent depth estimates. Static content is rendered via an adapted generalizable NeRF Transformer with masked attention to handle dynamic occlusions, while dynamic content is reconstructed from depth- and time-based priors, including track-based temporal information. Experiments on NVIDIA Dynamic Scenes and DyCheck demonstrate competitive performance against some scene-specific baselines, highlighting the value of depth priors for generalization and outlining limitations due to depth/tracking quality. The work emphasizes the need for advances in monocular depth estimation and temporal aggregation to move closer to fully generalized dynamic NVS from monocular videos.
Abstract
Rendering scenes observed in a monocular video from novel viewpoints is a challenging problem. For static scenes the community has studied both scene-specific optimization techniques, which optimize on every test scene, and generalized techniques, which only run a deep net forward pass on a test scene. In contrast, for dynamic scenes, scene-specific optimization techniques exist, but, to our best knowledge, there is currently no generalized method for dynamic novel view synthesis from a given monocular video. To answer whether generalized dynamic novel view synthesis from monocular videos is possible today, we establish an analysis framework based on existing techniques and work toward the generalized approach. We find a pseudo-generalized process without scene-specific appearance optimization is possible, but geometrically and temporally consistent depth estimates are needed. Despite no scene-specific appearance optimization, the pseudo-generalized approach improves upon some scene-specific methods.
