Table of Contents
Fetching ...

D-NPC: Dynamic Neural Point Clouds for Non-Rigid View Synthesis from Monocular Video

Moritz Kappel, Florian Hahlbohm, Timon Scholz, Susana Castillo, Christian Theobalt, Martin Eisemann, Vladislav Golyanik, Marcus Magnor

TL;DR

This work tackles non-rigid view synthesis from monocular video by introducing Dynamic Neural Point Clouds (D-NPC), which model a scene as a time-conditioned implicit point distribution with separate static and dynamic hash-encoded feature grids. By sampling explicit points from a temporal probability field and rendering with a fast differentiable rasterizer plus a lightweight neural renderer, D-NPC achieves interactive frame rates while maintaining competitive image quality. The method leverages monocular priors such as depth and foreground segmentation to initialize and guide optimization, enabling rapid convergence and robust dynamics handling. Overall, D-NPC delivers a practical monocular solution for dynamic view synthesis with strong perceptual results and real-time capabilities, offering a compelling direction for real-world applications like mobile capturing and interactive visualization.

Abstract

Dynamic reconstruction and spatiotemporal novel-view synthesis of non-rigidly deforming scenes recently gained increased attention. While existing work achieves impressive quality and performance on multi-view or teleporting camera setups, most methods fail to efficiently and faithfully recover motion and appearance from casual monocular captures. This paper contributes to the field by introducing a new method for dynamic novel view synthesis from monocular video, such as casual smartphone captures. Our approach represents the scene as a $\textit{dynamic neural point cloud}$, an implicit time-conditioned point distribution that encodes local geometry and appearance in separate hash-encoded neural feature grids for static and dynamic regions. By sampling a discrete point cloud from our model, we can efficiently render high-quality novel views using a fast differentiable rasterizer and neural rendering network. Similar to recent work, we leverage advances in neural scene analysis by incorporating data-driven priors like monocular depth estimation and object segmentation to resolve motion and depth ambiguities originating from the monocular captures. In addition to guiding the optimization process, we show that these priors can be exploited to explicitly initialize our scene representation to drastically improve optimization speed and final image quality. As evidenced by our experimental evaluation, our dynamic point cloud model not only enables fast optimization and real-time frame rates for interactive applications, but also achieves competitive image quality on monocular benchmark sequences. Our code and data are available online: https://moritzkappel.github.io/projects/dnpc/.

D-NPC: Dynamic Neural Point Clouds for Non-Rigid View Synthesis from Monocular Video

TL;DR

This work tackles non-rigid view synthesis from monocular video by introducing Dynamic Neural Point Clouds (D-NPC), which model a scene as a time-conditioned implicit point distribution with separate static and dynamic hash-encoded feature grids. By sampling explicit points from a temporal probability field and rendering with a fast differentiable rasterizer plus a lightweight neural renderer, D-NPC achieves interactive frame rates while maintaining competitive image quality. The method leverages monocular priors such as depth and foreground segmentation to initialize and guide optimization, enabling rapid convergence and robust dynamics handling. Overall, D-NPC delivers a practical monocular solution for dynamic view synthesis with strong perceptual results and real-time capabilities, offering a compelling direction for real-world applications like mobile capturing and interactive visualization.

Abstract

Dynamic reconstruction and spatiotemporal novel-view synthesis of non-rigidly deforming scenes recently gained increased attention. While existing work achieves impressive quality and performance on multi-view or teleporting camera setups, most methods fail to efficiently and faithfully recover motion and appearance from casual monocular captures. This paper contributes to the field by introducing a new method for dynamic novel view synthesis from monocular video, such as casual smartphone captures. Our approach represents the scene as a , an implicit time-conditioned point distribution that encodes local geometry and appearance in separate hash-encoded neural feature grids for static and dynamic regions. By sampling a discrete point cloud from our model, we can efficiently render high-quality novel views using a fast differentiable rasterizer and neural rendering network. Similar to recent work, we leverage advances in neural scene analysis by incorporating data-driven priors like monocular depth estimation and object segmentation to resolve motion and depth ambiguities originating from the monocular captures. In addition to guiding the optimization process, we show that these priors can be exploited to explicitly initialize our scene representation to drastically improve optimization speed and final image quality. As evidenced by our experimental evaluation, our dynamic point cloud model not only enables fast optimization and real-time frame rates for interactive applications, but also achieves competitive image quality on monocular benchmark sequences. Our code and data are available online: https://moritzkappel.github.io/projects/dnpc/.
Paper Structure (25 sections, 7 equations, 7 figures, 12 tables)

This paper contains 25 sections, 7 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Method overview. Given a monocular RGB video and priors extracted from out-of-the-box estimators, we initialize and optimize a dynamic implicit neural point cloud, consisting of a spatiotemporal point position distribution and two feature grids for static and dynamic scene content. By sampling an explicit point cloud for a discrete timestamp, our model can synthesize novel views, including foreground/background separation, using a differentiable rasterizer and neural renderer.
  • Figure 2: Qualitative comparisons on the NVIDIA dataset. Our method preserves fine details and clean foreground-background transitions.
  • Figure 3: Comparison on the iPhone dataset. We compare on five challenging scenes of the iPhone dataset. Our method better preserves the object shapes and fine details despite complex deformations. Methods indicated by $\dagger$ are the highest quality versions provided by Gao et al.gao2022Neurips. The areas highlighted in red indicate lack of co-visibility between training and validation.
  • Figure 4: Novel views synthesized using complex in-the-wild sequences of the DAVIS dataset. On the right, we show the corresponding visualizations of our implicit foreground/background separation.
  • Figure 5: Failure Cases. Top: Cropped foreground not contained in the training camera frustum. Bottom: Object duplications due to depth misalignment.
  • ...and 2 more figures