Table of Contents
Fetching ...

Modeling Ambient Scene Dynamics for Free-view Synthesis

Meng-Li Shih, Jia-Bin Huang, Changil Kim, Rajvi Shah, Johannes Kopf, Chen Gao

TL;DR

This work tackles dynamic ambient free-view synthesis from monocular captures by extending 3D Gaussian Splatting to model time-varying scene components. It introduces per-Gaussian motion trajectories encoded with a DCT-based basis, predicted by an MLP, and stabilized by rigidity and depth regularization to handle unbounded scenes. The method employs a three-stage pipeline with depth-guided static reconstruction and memory-efficient, multi-pass rendering to render high-fidelity novel views and unseen motions, validated on a new plant/ambient-dynamics forest dataset. The contributions include a monocular dynamic ambient dataset, motion-editing capabilities, and substantial improvements over baselines in both qualitative and quantitative metrics, enabling realistic immersive viewpoints for unbounded outdoor scenes.

Abstract

We introduce a novel method for dynamic free-view synthesis of an ambient scenes from a monocular capture bringing a immersive quality to the viewing experience. Our method builds upon the recent advancements in 3D Gaussian Splatting (3DGS) that can faithfully reconstruct complex static scenes. Previous attempts to extend 3DGS to represent dynamics have been confined to bounded scenes or require multi-camera captures, and often fail to generalize to unseen motions, limiting their practical application. Our approach overcomes these constraints by leveraging the periodicity of ambient motions to learn the motion trajectory model, coupled with careful regularization. We also propose important practical strategies to improve the visual quality of the baseline 3DGS static reconstructions and to improve memory efficiency critical for GPU-memory intensive learning. We demonstrate high-quality photorealistic novel view synthesis of several ambient natural scenes with intricate textures and fine structural elements.

Modeling Ambient Scene Dynamics for Free-view Synthesis

TL;DR

This work tackles dynamic ambient free-view synthesis from monocular captures by extending 3D Gaussian Splatting to model time-varying scene components. It introduces per-Gaussian motion trajectories encoded with a DCT-based basis, predicted by an MLP, and stabilized by rigidity and depth regularization to handle unbounded scenes. The method employs a three-stage pipeline with depth-guided static reconstruction and memory-efficient, multi-pass rendering to render high-fidelity novel views and unseen motions, validated on a new plant/ambient-dynamics forest dataset. The contributions include a monocular dynamic ambient dataset, motion-editing capabilities, and substantial improvements over baselines in both qualitative and quantitative metrics, enabling realistic immersive viewpoints for unbounded outdoor scenes.

Abstract

We introduce a novel method for dynamic free-view synthesis of an ambient scenes from a monocular capture bringing a immersive quality to the viewing experience. Our method builds upon the recent advancements in 3D Gaussian Splatting (3DGS) that can faithfully reconstruct complex static scenes. Previous attempts to extend 3DGS to represent dynamics have been confined to bounded scenes or require multi-camera captures, and often fail to generalize to unseen motions, limiting their practical application. Our approach overcomes these constraints by leveraging the periodicity of ambient motions to learn the motion trajectory model, coupled with careful regularization. We also propose important practical strategies to improve the visual quality of the baseline 3DGS static reconstructions and to improve memory efficiency critical for GPU-memory intensive learning. We demonstrate high-quality photorealistic novel view synthesis of several ambient natural scenes with intricate textures and fine structural elements.
Paper Structure (26 sections, 7 equations, 11 figures, 2 tables)

This paper contains 26 sections, 7 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Limitations of existing methods. Here we highlight the limitations of state-of-the-art dynamic radiance fields in addressing the ambient dynamics in an unbounded scene. RoDynRF suffers from severe blurriness due to the use of voxel-grid representation. 4D-GS can recover some spatial details for contents close to the cameras, but struggles with handling ambient motion, resulting in unstable foreground motion and inaccuracies in background motion. Please refer to the supplementary video for comparison.
  • Figure 2: Method Overview. Our method comprises three stages: 1) pre-processing, 2) static scene reconstruction, and 3) dynamic scene reconstruction. In pre-processing stage, we extract the rendered depth map using Instant-NGP mueller2022instant. The rendered depth from radiance field provide essential depth regualrization for unbounded scenes due to poor 3D point cloud recovery in distant regions. In the static scene reconstruction phase, we leverage both photometric and depth loss information obtained from the captured photos and reconstructed radiance field (from Instant-NGP), respectively. This stage allows us to produce high-quality static 3D Gaussian representations of the scene. However, the resulting representations do not model time-varying components like the ambient scene motion. The dynamic regions inevitably are blurry (due to inconsistent photometric losses across different frames). In the dynamic scene reconstruction stage, we introduce temporal parameters to explicitly model the dynamics of each individual 3D Gaussians. We do so by using a triplane-based representation to predict the DCT coefficents for each point (the center position of each 3D Gaussian). Using the predicted coefficients, we can recover the time-varying translation and rotation given any time step $t$. We supervise the motion representation using photometric loss and rigidity regularization. The resulting representation allow us to model the time-varying 3D Gaussians and thus render high-quality frames from novel view and time.
  • Figure 3: The importance of depth regularization. The quality of 3D-GS depends heavily on the accurate 3D point cloud initialization. In unbounded scenes, however, the geometry of scene elements far away from the camera cannot be reliably reconstructed with structure from motion algorithms (due to small motion parallax). Consequently, 3D-GS tends to predict incorrect geometry in the background and render blurry image due to the lack of initial Gaussians (top). We address this challenge by applying the depth regularization. With the regularization, we observe more accurate and detailed appearance in our rendering (bottom).
  • Figure 4: Effect of different scene normalization strategies. (a) Normalizing the scene based on the range of 3D point cloud often results in an inefficient use of representational power (because the scene scale can be very large). This typically leads to a blurry foreground and an incorrectly rendered background. (b) When normalizing the scene using the range of camera poses and applying $\infty$-norm contraction, the foreground becomes sharper. However, the background remains blurry due to inaccurately predicted motion. (c) We propose to normalize the scene using the range of camera poses and map points outside this range to the boundary. Our results show that this achieves higher-quality synthesis in both the foreground and background regions.
  • Figure 5: Effect of time-varying parameters. In stage 2, the absence of time-varying parameters leads to blurry and ghosting artifacts. In contrast, stage 3's joint optimization of time-varying and time-independent parameters allows for accurate reconstruction of ambient motion and 3-D geometry.
  • ...and 6 more figures