Table of Contents
Fetching ...

RoDyGS: Robust Dynamic Gaussian Splatting for Casual Videos

Yoonwoo Jeong, Junmyeong Lee, Hoseung Choi, Minsu Cho

TL;DR

RoDyGS tackles dynamic view synthesis from casual videos by explicitly separating static backgrounds and dynamic objects and by enforcing physically plausible motion and geometry through novel regularizers. It builds on 3D Gaussian Splatting with a dynamic extension via learnable motion bases, guided by geometric priors (MASt3R) and motion masks (TAM), and optimizes camera poses jointly with Gaussians. The authors introduce Kubric-MRig, a challenging benchmark that combines wide viewpoints, large camera/object motion, and concurrent multi-view captures to evaluate pose estimation and rendering quality. Empirical results show that RoDyGS outperforms pose-free dynamic neural fields and reaches competitive rendering quality with pose-free static fields, while ablations confirm the effectiveness of the geometry and motion regularizers. The work advances practical dynamic scene reconstruction from casual videos and provides a robust framework for future pose-free dynamic NVS research.

Abstract

Dynamic view synthesis (DVS) has advanced remarkably in recent years, achieving high-fidelity rendering while reducing computational costs. Despite the progress, optimizing dynamic neural fields from casual videos remains challenging, as these videos do not provide direct 3D information, such as camera trajectories or the underlying scene geometry. In this work, we present RoDyGS, an optimization pipeline for dynamic Gaussian Splatting from casual videos. It effectively learns motion and underlying geometry of scenes by separating dynamic and static primitives, and ensures that the learned motion and geometry are physically plausible by incorporating motion and geometric regularization terms. We also introduce a comprehensive benchmark, Kubric-MRig, that provides extensive camera and object motion along with simultaneous multi-view captures, features that are absent in previous benchmarks. Experimental results demonstrate that the proposed method significantly outperforms previous pose-free dynamic neural fields and achieves competitive rendering quality compared to existing pose-free static neural fields. The code and data are publicly available at https://rodygs.github.io/.

RoDyGS: Robust Dynamic Gaussian Splatting for Casual Videos

TL;DR

RoDyGS tackles dynamic view synthesis from casual videos by explicitly separating static backgrounds and dynamic objects and by enforcing physically plausible motion and geometry through novel regularizers. It builds on 3D Gaussian Splatting with a dynamic extension via learnable motion bases, guided by geometric priors (MASt3R) and motion masks (TAM), and optimizes camera poses jointly with Gaussians. The authors introduce Kubric-MRig, a challenging benchmark that combines wide viewpoints, large camera/object motion, and concurrent multi-view captures to evaluate pose estimation and rendering quality. Empirical results show that RoDyGS outperforms pose-free dynamic neural fields and reaches competitive rendering quality with pose-free static fields, while ablations confirm the effectiveness of the geometry and motion regularizers. The work advances practical dynamic scene reconstruction from casual videos and provides a robust framework for future pose-free dynamic NVS research.

Abstract

Dynamic view synthesis (DVS) has advanced remarkably in recent years, achieving high-fidelity rendering while reducing computational costs. Despite the progress, optimizing dynamic neural fields from casual videos remains challenging, as these videos do not provide direct 3D information, such as camera trajectories or the underlying scene geometry. In this work, we present RoDyGS, an optimization pipeline for dynamic Gaussian Splatting from casual videos. It effectively learns motion and underlying geometry of scenes by separating dynamic and static primitives, and ensures that the learned motion and geometry are physically plausible by incorporating motion and geometric regularization terms. We also introduce a comprehensive benchmark, Kubric-MRig, that provides extensive camera and object motion along with simultaneous multi-view captures, features that are absent in previous benchmarks. Experimental results demonstrate that the proposed method significantly outperforms previous pose-free dynamic neural fields and achieves competitive rendering quality compared to existing pose-free static neural fields. The code and data are publicly available at https://rodygs.github.io/.

Paper Structure

This paper contains 49 sections, 16 equations, 12 figures, 17 tables.

Figures (12)

  • Figure 1: Robust Dynamic Gaussian Splatting (RoDyGS). RoDyGS achieves high-fidelity rendering of novel viewpoints from casual videos, significantly outperforming RoDynRF, which struggles with blurriness during substantial camera and object movement.
  • Figure 2: RoDyGS Pipeline Overview. Starting with a casually captured video input, RoDyGS extracts camera poses and depths using MASt3R leroy2024grounding, while motion masks are derived from TAM yang2023track. It then separates static and dynamic Gaussians, enabling each to be independently learned for stationary background and moving objects. The primary optimization objective, $L_{gs}$, includes photometric loss and Pearson depth loss, with depth guidance extracted from images using DepthAnything depth_anything_v1. Additionally, for dynamic Gaussians, Gaussian distance-preserving regularization ($\mathcal{L}_{tc}$) and surface smoothness regularization ($\mathcal{L}_{s}$) are applied. For the motion bases, continuous motion regularization ($\mathcal{L}_{mc}$) is employed.
  • Figure 3: Qualitative results on Kubric-MRig and iPhone. Our pipeline accurately reconstructs scene geometry, produces sharp renderings, and aligns object positions well. Without GT camera poses, RoDynRF struggles to learn the scene geometry, resulting in object positions that differ from the GT. Even with GT camera poses, RoDynRF produces blurry results.
  • Figure 4: Impact of regularization terms. Our regularization effectively enhances the perceptual quality of the rendering results, leading to sharper and more realistic renderings.
  • Figure 5: Comparison of motion masks between TAM yang2023track and RoDynRF liu2023robust.
  • ...and 7 more figures