Table of Contents
Fetching ...

MoDGS: Dynamic Gaussian Splatting from Casually-captured Monocular Videos with Depth Priors

Qingming Liu, Yuan Liu, Jiepeng Wang, Xianqiang Lyv, Peng Wang, Wenping Wang, Junhui Hou

TL;DR

MoDGS tackles dynamic novel-view synthesis from casually captured monocular videos by representing a scene with Gaussians in a canonical space and a time-conditioned deformation field, then rendering via splatting. It introduces a 3D-aware initialization to robustly bootstrap the deformation and Gaussian placement, and an ordinal depth loss to exploit depth orders from single-view priors while mitigating scale inconsistencies. The approach outperforms state-of-the-art baselines on multiple datasets, including in-the-wild video sequences, demonstrating strong robustness to casual capture conditions. The work highlights depth-guided 3D priors as a practical enabler for dynamic scene reconstruction when multiview cues are weak or unavailable.

Abstract

In this paper, we propose MoDGS, a new pipeline to render novel views of dy namic scenes from a casually captured monocular video. Previous monocular dynamic NeRF or Gaussian Splatting methods strongly rely on the rapid move ment of input cameras to construct multiview consistency but struggle to recon struct dynamic scenes on casually captured input videos whose cameras are either static or move slowly. To address this challenging task, MoDGS adopts recent single-view depth estimation methods to guide the learning of the dynamic scene. Then, a novel 3D-aware initialization method is proposed to learn a reasonable deformation field and a new robust depth loss is proposed to guide the learning of dynamic scene geometry. Comprehensive experiments demonstrate that MoDGS is able to render high-quality novel view images of dynamic scenes from just a casually captured monocular video, which outperforms state-of-the-art meth ods by a significant margin. The code will be publicly available.

MoDGS: Dynamic Gaussian Splatting from Casually-captured Monocular Videos with Depth Priors

TL;DR

MoDGS tackles dynamic novel-view synthesis from casually captured monocular videos by representing a scene with Gaussians in a canonical space and a time-conditioned deformation field, then rendering via splatting. It introduces a 3D-aware initialization to robustly bootstrap the deformation and Gaussian placement, and an ordinal depth loss to exploit depth orders from single-view priors while mitigating scale inconsistencies. The approach outperforms state-of-the-art baselines on multiple datasets, including in-the-wild video sequences, demonstrating strong robustness to casual capture conditions. The work highlights depth-guided 3D priors as a practical enabler for dynamic scene reconstruction when multiview cues are weak or unavailable.

Abstract

In this paper, we propose MoDGS, a new pipeline to render novel views of dy namic scenes from a casually captured monocular video. Previous monocular dynamic NeRF or Gaussian Splatting methods strongly rely on the rapid move ment of input cameras to construct multiview consistency but struggle to recon struct dynamic scenes on casually captured input videos whose cameras are either static or move slowly. To address this challenging task, MoDGS adopts recent single-view depth estimation methods to guide the learning of the dynamic scene. Then, a novel 3D-aware initialization method is proposed to learn a reasonable deformation field and a new robust depth loss is proposed to guide the learning of dynamic scene geometry. Comprehensive experiments demonstrate that MoDGS is able to render high-quality novel view images of dynamic scenes from just a casually captured monocular video, which outperforms state-of-the-art meth ods by a significant margin. The code will be publicly available.
Paper Structure (54 sections, 6 equations, 17 figures, 13 tables)

This paper contains 54 sections, 6 equations, 17 figures, 13 tables.

Figures (17)

  • Figure 1: Given a casually captured monocular video of a dynamic scene, MoDGS is able to synthesize high-quality novel-view images in this scene. In the middle column, the baseline method yang2023deformable fails to correctly reconstruct the 3D dynamic scenes on this static monocular video. The white regions in cyan bounding boxes are not visible in the input video (red bounding boxes) so there are some artifacts for these invisible regions. In the rightmost column, the input estimated monocular depth is inconsistent (red bounding boxes); however, our proposed ordinal depth loss effectively ensures more consistent depth outputs. This loss enhances the accuracy and reliability of learning underlying geometry.
  • Figure 2: Overview. Given a casually captured monocular video of a dynamic scene, MoDGS represents the dynamic scene with a set of Gaussians in a canonical space and a deformation field represented by an MLP $\mathcal{T}$. To render an image at a specific timestamp $t$, we deform all the Gaussians by $\mathcal{T}_t$ and then use the splatting technique to render images and depth maps. While in training MoDGS, we use a single-view depth estimator GeoWizard fu2024geowizard to estimate depth maps and compute the rendering loss and an ordinal depth loss for training.
  • Figure 3: (a) Initialization of the deformation field. We first lift the depth maps and a 2D flow to a 3D flow and train the deformation field for initialization. (b) Initialization of Gaussians in the canonical space. We use the initialized deformation field to deform all the depth points to the canonical space and downsample these depth points to initialize Gaussians.
  • Figure 4: We show the estimated single-view depth maps at two different timestamps $D_{t_i}$ and $D_{t_j}$ after normalization to the same scale. Since the single-view depth estimator is not accurate enough, the depth maps are not linear related so the scale normalization does not perfectly align them. However, the order of depth values on three corresponding pixels is stable for these two depth maps, which motivates us to propose an ordinal depth loss for supervision.
  • Figure 5: Qualitative comparison on the novel-view renderings of the DyNeRF li2022neural and Nvidia yoon2020novel datasets. We compare MoDGS with SC-GS huang2024sc-gs, Deformable-GS yang2023deformable, and HexPlane cao2023hexplane.
  • ...and 12 more figures