Table of Contents
Fetching ...

MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting

Haoran Zhou, Gim Hee Lee

Abstract

Realistic reconstruction of dynamic 4D scenes from monocular videos is essential for understanding the physical world. Despite recent progress in neural rendering, existing methods often struggle to recover accurate 3D geometry and temporally consistent motion in complex environments. To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences while maintaining high-fidelity structural and motion coherence. At the core of our approach is a scalable motion field parameterized by cluster-centric basis transformations that adaptively expand to capture diverse and evolving motion patterns. To ensure robust reconstruction over long durations, we introduce a progressive optimization strategy comprising two decoupled propagation stages: 1) A background extension stage that adapts to newly visible regions, refines camera poses, and explicitly models transient shadows; 2) A foreground propagation stage that enforces motion consistency through a specialized three-stage refinement process. Extensive experiments on challenging real-world benchmarks demonstrate that MotionScale significantly outperforms state-of-the-art methods in both reconstruction quality and temporal stability. Project page: https://hrzhou2.github.io/motion-scale-web/.

MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting

Abstract

Realistic reconstruction of dynamic 4D scenes from monocular videos is essential for understanding the physical world. Despite recent progress in neural rendering, existing methods often struggle to recover accurate 3D geometry and temporally consistent motion in complex environments. To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences while maintaining high-fidelity structural and motion coherence. At the core of our approach is a scalable motion field parameterized by cluster-centric basis transformations that adaptively expand to capture diverse and evolving motion patterns. To ensure robust reconstruction over long durations, we introduce a progressive optimization strategy comprising two decoupled propagation stages: 1) A background extension stage that adapts to newly visible regions, refines camera poses, and explicitly models transient shadows; 2) A foreground propagation stage that enforces motion consistency through a specialized three-stage refinement process. Extensive experiments on challenging real-world benchmarks demonstrate that MotionScale significantly outperforms state-of-the-art methods in both reconstruction quality and temporal stability. Project page: https://hrzhou2.github.io/motion-scale-web/.

Paper Structure

This paper contains 14 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Visualization of a reconstructed dynamic scene and extracted moving objects. Given a single monocular video as input, MotionScale reconstructs a 4D scene representation that effectively captures photorealistic appearance, accurate 3D geometry, diverse human motion. Refer to the supplementary material for video results and additional examples.
  • Figure 2: Overview of MotionScale. Our method adopts a scalable motion field that progressively captures object motions through an adaptive control mechanism, enabling efficient splitting and refinement of motion components. For optimization, the background is updated through region sampling, camera refinement, and shadow handling, while the foreground propagation employs a three-stage refinement to propagate motion across long temporal windows for consistent 4D reconstruction.
  • Figure 3: Comparison of dynamic scene reconstruction results on challenging real-world videos from DAVIS dataset. We compare MotionScale with Shape of Motion wang2025shape and GFlow wang2025gflow on several dynamic scenes containing complex object motions, occlusions, and large appearance variations. For the top rows, we show rendered results under two different viewpoints for each compared method.
  • Figure 4: Visual comparison of ablation results.