Table of Contents
Fetching ...

Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding

Haoran Zhou, Gim Hee Lee

TL;DR

Motion4D addresses the lack of 3D consistency in dynamic scene understanding by unifying 2D priors with a 4D Gaussian Splatting representation. It introduces an iterative, two-stage optimization (sequential and global) that refines motion and semantics through 3D confidence-guided supervision, adaptive resampling, and SAM2-driven semantic prompts. The framework achieves superior performance across segmentation, 2D/3D point tracking, and novel view synthesis, validated on DyCheck-VOS and DAVIS with extensive ablations. A new DyCheck-VOS benchmark further emphasizes semantic consistency in dynamic scenes. Overall, Motion4D demonstrates robust, temporally coherent 4D reconstruction by tightly coupling 2D priors and 3D dynamic representations.

Abstract

Recent advancements in foundation models for 2D vision have substantially improved the analysis of dynamic scenes from monocular videos. However, despite their strong generalization capabilities, these models often lack 3D consistency, a fundamental requirement for understanding scene geometry and motion, thereby causing severe spatial misalignment and temporal flickering in complex 3D environments. In this paper, we present Motion4D, a novel framework that addresses these challenges by integrating 2D priors from foundation models into a unified 4D Gaussian Splatting representation. Our method features a two-part iterative optimization framework: 1) Sequential optimization, which updates motion and semantic fields in consecutive stages to maintain local consistency, and 2) Global optimization, which jointly refines all attributes for long-term coherence. To enhance motion accuracy, we introduce a 3D confidence map that dynamically adjusts the motion priors, and an adaptive resampling process that inserts new Gaussians into under-represented regions based on per-pixel RGB and semantic errors. Furthermore, we enhance semantic coherence through an iterative refinement process that resolves semantic inconsistencies by alternately optimizing the semantic fields and updating prompts of SAM2. Extensive evaluations demonstrate that our Motion4D significantly outperforms both 2D foundation models and existing 3D-based approaches across diverse scene understanding tasks, including point-based tracking, video object segmentation, and novel view synthesis. Our code is available at https://hrzhou2.github.io/motion4d-web/.

Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding

TL;DR

Motion4D addresses the lack of 3D consistency in dynamic scene understanding by unifying 2D priors with a 4D Gaussian Splatting representation. It introduces an iterative, two-stage optimization (sequential and global) that refines motion and semantics through 3D confidence-guided supervision, adaptive resampling, and SAM2-driven semantic prompts. The framework achieves superior performance across segmentation, 2D/3D point tracking, and novel view synthesis, validated on DyCheck-VOS and DAVIS with extensive ablations. A new DyCheck-VOS benchmark further emphasizes semantic consistency in dynamic scenes. Overall, Motion4D demonstrates robust, temporally coherent 4D reconstruction by tightly coupling 2D priors and 3D dynamic representations.

Abstract

Recent advancements in foundation models for 2D vision have substantially improved the analysis of dynamic scenes from monocular videos. However, despite their strong generalization capabilities, these models often lack 3D consistency, a fundamental requirement for understanding scene geometry and motion, thereby causing severe spatial misalignment and temporal flickering in complex 3D environments. In this paper, we present Motion4D, a novel framework that addresses these challenges by integrating 2D priors from foundation models into a unified 4D Gaussian Splatting representation. Our method features a two-part iterative optimization framework: 1) Sequential optimization, which updates motion and semantic fields in consecutive stages to maintain local consistency, and 2) Global optimization, which jointly refines all attributes for long-term coherence. To enhance motion accuracy, we introduce a 3D confidence map that dynamically adjusts the motion priors, and an adaptive resampling process that inserts new Gaussians into under-represented regions based on per-pixel RGB and semantic errors. Furthermore, we enhance semantic coherence through an iterative refinement process that resolves semantic inconsistencies by alternately optimizing the semantic fields and updating prompts of SAM2. Extensive evaluations demonstrate that our Motion4D significantly outperforms both 2D foundation models and existing 3D-based approaches across diverse scene understanding tasks, including point-based tracking, video object segmentation, and novel view synthesis. Our code is available at https://hrzhou2.github.io/motion4d-web/.

Paper Structure

This paper contains 14 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Visualization of segmentation and tracking results. We compare Motion4D with state-of-the-art 2D foundation models, highlighting their lack of 3D consistency. As shown, existing 2D approaches often suffer from temporal flickering (green box) or spatial misalignment (red box).
  • Figure 2: Overview of Motion4D. Our Motion4D introduces an iterative refinement framework, consisting of (c) sequential optimization and global optimization stages. We develop (a) an iterative motion refinement module that uses 3D confidence maps and adaptive resampling to improve motion accuracy, and (b) an iterative semantic refinement module to refine the semantic field.
  • Figure 3: Illustration of (a) iterative semantic refinement and sequential optimization processes for (b) the semantic field and (c) the motion field. The proposed strategies effectively update 2D input priors to achieve consistent results across space and time.
  • Figure 4: Visualization of segmentation results on DyCheck-VOS (our proposed VOS benchmark). We provide our results of both rendered masks (Motion4D) and refined SAM2 masks (Motion4D*). As shown, the 2D predictions lack 3D consistency, which leads to misaligned spatial structures.
  • Figure 5: Visualization of 2D point tracking results. Motion4D maintains stable and accurate point trajectories even under severe occlusions and drastic object or camera motion.