Table of Contents
Fetching ...

Gaussian Sequences with Multi-Scale Dynamics for 4D Reconstruction from Monocular Casual Videos

Can Li, Jie Gu, Jingmin Chen, Fangzhou Qiu, Lei Sun

TL;DR

This paper tackles the challenge of reconstructing four-dimensional dynamic scenes from strictly monocular casual videos. It introduces Gaussian Sequences with MS-Dynamics, a structured, multi-scale motion representation that factorizes dynamics into object-level, sparse-primitive, and fine-grained components, combined with multi-modal priors from vision foundation models to constrain the solution space. The method demonstrates substantial gains in dynamic novel-view synthesis on both benchmark and custom monocular datasets, outperforming state-of-the-art dynamic Gaussian and NeRF-based approaches. The work advances robust 4D reconstruction for embodied AI by enabling globally consistent, physically plausible dynamics under monocular supervision and accelerates practical robot learning workflows.

Abstract

Understanding dynamic scenes from casual videos is critical for scalable robot learning, yet four-dimensional (4D) reconstruction under strictly monocular settings remains highly ill-posed. To address this challenge, our key insight is that real-world dynamics exhibits a multi-scale regularity from object to particle level. To this end, we design the multi-scale dynamics mechanism that factorizes complex motion fields. Within this formulation, we propose Gaussian sequences with multi-scale dynamics, a novel representation for dynamic 3D Gaussians derived through compositions of multi-level motion. This layered structure substantially alleviates ambiguity of reconstruction and promotes physically plausible dynamics. We further incorporate multi-modal priors from vision foundation models to establish complementary supervision, constraining the solution space and improving the reconstruction fidelity. Our approach enables accurate and globally consistent 4D reconstruction from monocular casual videos. Experiments of dynamic novel-view synthesis (NVS) on benchmark and real-world manipulation datasets demonstrate considerable improvements over existing methods.

Gaussian Sequences with Multi-Scale Dynamics for 4D Reconstruction from Monocular Casual Videos

TL;DR

This paper tackles the challenge of reconstructing four-dimensional dynamic scenes from strictly monocular casual videos. It introduces Gaussian Sequences with MS-Dynamics, a structured, multi-scale motion representation that factorizes dynamics into object-level, sparse-primitive, and fine-grained components, combined with multi-modal priors from vision foundation models to constrain the solution space. The method demonstrates substantial gains in dynamic novel-view synthesis on both benchmark and custom monocular datasets, outperforming state-of-the-art dynamic Gaussian and NeRF-based approaches. The work advances robust 4D reconstruction for embodied AI by enabling globally consistent, physically plausible dynamics under monocular supervision and accelerates practical robot learning workflows.

Abstract

Understanding dynamic scenes from casual videos is critical for scalable robot learning, yet four-dimensional (4D) reconstruction under strictly monocular settings remains highly ill-posed. To address this challenge, our key insight is that real-world dynamics exhibits a multi-scale regularity from object to particle level. To this end, we design the multi-scale dynamics mechanism that factorizes complex motion fields. Within this formulation, we propose Gaussian sequences with multi-scale dynamics, a novel representation for dynamic 3D Gaussians derived through compositions of multi-level motion. This layered structure substantially alleviates ambiguity of reconstruction and promotes physically plausible dynamics. We further incorporate multi-modal priors from vision foundation models to establish complementary supervision, constraining the solution space and improving the reconstruction fidelity. Our approach enables accurate and globally consistent 4D reconstruction from monocular casual videos. Experiments of dynamic novel-view synthesis (NVS) on benchmark and real-world manipulation datasets demonstrate considerable improvements over existing methods.
Paper Structure (22 sections, 12 equations, 7 figures, 3 tables)

This paper contains 22 sections, 12 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) From a casually captured monocular video with complex hand–object interactions (left), our MS-Dynamics models multi-scale dynamics to drive 4D Gaussians, producing a temporally-coherent Gaussian sequence that synthesizes novel views with fine hand details (middle). Without MS-Dynamics, the 4D reconstruction is blurry and lacks structural fidelity (right). (b) An application to hand-held data collection (left): starting from a video demonstration of cup deformation (middle), our method generates novel-view demonstrations (right).
  • Figure 2: Overview of Gaussian Sequences with MS-Dynamics for 4D monocular reconstruction. The pipeline first preprocesses monocular videos to obtain depths, masks, point tracks, and camera parameters. Our MS-Dynamics performs multi-scale factorization from object ($L_1$), through sparse-primitive ($L_2$), to fine-grained level ($L_3$), capturing both global motion and local detailed deformation. Cross-frame Gaussian dynamics from canonical to target frame is modeled by shared weighted MS-Dynamics, constructing globally consistent Gaussian sequences. Both Gaussian sequences and MS-Dynamics are supervised by the aggregation of multi-modal signals (such as RGBs, depths, and tracks), which provides complementary cues for globally consistent optimization. The resulting Gaussian sequences enable high-quality dynamic NVS.
  • Figure 3: Experimental setup for our custom datasets.
  • Figure 4: Qualitative results of NVS on iPhone datasets gao2022monocular. Ours synthesizes finer details than baselines. (GT: Ground truth)
  • Figure 5: Qualitative results of dynamic NVS of our method on custom datasets containing hand- or gripper-object interactions. Our MS-Dynamics effectively represents these interaction dynamics and, even when trained under strictly monocular views, produces detailed novel views under considerable viewpoint changes.
  • ...and 2 more figures