MoDGS: Dynamic Gaussian Splatting from Casually-captured Monocular Videos with Depth Priors
Qingming Liu, Yuan Liu, Jiepeng Wang, Xianqiang Lyv, Peng Wang, Wenping Wang, Junhui Hou
TL;DR
MoDGS tackles dynamic novel-view synthesis from casually captured monocular videos by representing a scene with Gaussians in a canonical space and a time-conditioned deformation field, then rendering via splatting. It introduces a 3D-aware initialization to robustly bootstrap the deformation and Gaussian placement, and an ordinal depth loss to exploit depth orders from single-view priors while mitigating scale inconsistencies. The approach outperforms state-of-the-art baselines on multiple datasets, including in-the-wild video sequences, demonstrating strong robustness to casual capture conditions. The work highlights depth-guided 3D priors as a practical enabler for dynamic scene reconstruction when multiview cues are weak or unavailable.
Abstract
In this paper, we propose MoDGS, a new pipeline to render novel views of dy namic scenes from a casually captured monocular video. Previous monocular dynamic NeRF or Gaussian Splatting methods strongly rely on the rapid move ment of input cameras to construct multiview consistency but struggle to recon struct dynamic scenes on casually captured input videos whose cameras are either static or move slowly. To address this challenging task, MoDGS adopts recent single-view depth estimation methods to guide the learning of the dynamic scene. Then, a novel 3D-aware initialization method is proposed to learn a reasonable deformation field and a new robust depth loss is proposed to guide the learning of dynamic scene geometry. Comprehensive experiments demonstrate that MoDGS is able to render high-quality novel view images of dynamic scenes from just a casually captured monocular video, which outperforms state-of-the-art meth ods by a significant margin. The code will be publicly available.
