Table of Contents
Fetching ...

From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting

Jianing Chen, Zehao Li, Yujun Cai, Hao Jiang, Shuqin Gao, Honglong Zhao, Tianlu Mao, Yucheng Zhang

TL;DR

This work tackles monocular dynamic 3D reconstruction by addressing the mismatch between sparse control point allocation and motion complexity in existing methods. It introduces a motion-adaptive framework that leverages semantic and motion priors from vision foundation models to populate and compress a node-based deformation basis, followed by a spline-parameterized trajectory model that provides smooth, compact motion representation. A dual-quaternion-based blending scheme propagates node motion to Gaussians, while patch-to-node initialization and adaptive compression focus modeling capacity on dynamic regions. Extensive experiments on Hyper-NeRF and N3DV demonstrate superior reconstruction quality and efficiency, with ablations confirming the benefits of semantic-guided node initialization and spline-based motion, highlighting practical impact for real-time dynamic 3D rendering and editing.

Abstract

Dynamic 3D reconstruction from monocular videos remains difficult due to the ambiguity inferring 3D motion from limited views and computational demands of modeling temporally varying scenes. While recent sparse control methods alleviate computation by reducing millions of Gaussians to thousands of control points, they suffer from a critical limitation: they allocate points purely by geometry, leading to static redundancy and dynamic insufficiency. We propose a motion-adaptive framework that aligns control density with motion complexity. Leveraging semantic and motion priors from vision foundation models, we establish patch-token-node correspondences and apply motion-adaptive compression to concentrate control points in dynamic regions while suppressing redundancy in static backgrounds. Our approach achieves flexible representational density adaptation through iterative voxelization and motion tendency scoring, directly addressing the fundamental mismatch between control point allocation and motion complexity. To capture temporal evolution, we introduce spline-based trajectory parameterization initialized by 2D tracklets, replacing MLP-based deformation fields to achieve smoother motion representation and more stable optimization. Extensive experiments demonstrate significant improvements in reconstruction quality and efficiency over existing state-of-the-art methods.

From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting

TL;DR

This work tackles monocular dynamic 3D reconstruction by addressing the mismatch between sparse control point allocation and motion complexity in existing methods. It introduces a motion-adaptive framework that leverages semantic and motion priors from vision foundation models to populate and compress a node-based deformation basis, followed by a spline-parameterized trajectory model that provides smooth, compact motion representation. A dual-quaternion-based blending scheme propagates node motion to Gaussians, while patch-to-node initialization and adaptive compression focus modeling capacity on dynamic regions. Extensive experiments on Hyper-NeRF and N3DV demonstrate superior reconstruction quality and efficiency, with ablations confirming the benefits of semantic-guided node initialization and spline-based motion, highlighting practical impact for real-time dynamic 3D rendering and editing.

Abstract

Dynamic 3D reconstruction from monocular videos remains difficult due to the ambiguity inferring 3D motion from limited views and computational demands of modeling temporally varying scenes. While recent sparse control methods alleviate computation by reducing millions of Gaussians to thousands of control points, they suffer from a critical limitation: they allocate points purely by geometry, leading to static redundancy and dynamic insufficiency. We propose a motion-adaptive framework that aligns control density with motion complexity. Leveraging semantic and motion priors from vision foundation models, we establish patch-token-node correspondences and apply motion-adaptive compression to concentrate control points in dynamic regions while suppressing redundancy in static backgrounds. Our approach achieves flexible representational density adaptation through iterative voxelization and motion tendency scoring, directly addressing the fundamental mismatch between control point allocation and motion complexity. To capture temporal evolution, we introduce spline-based trajectory parameterization initialized by 2D tracklets, replacing MLP-based deformation fields to achieve smoother motion representation and more stable optimization. Extensive experiments demonstrate significant improvements in reconstruction quality and efficiency over existing state-of-the-art methods.

Paper Structure

This paper contains 28 sections, 13 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The overview of our method. (A) Given a monocular video, we extract semantic and motion priors from pre-trained vision foundation models. (B) These priors guide motion-adaptive node initialization, yielding compact distributions aligned with dynamic regions. (C) The initialized nodes are assigned spline-parameterized trajectories to provide a motion basis. (D) Node motions are propagated to Gaussians through deformation, transforming the canonical representation. (E) The deformed model is rendered and optimized for consistent reconstruction.
  • Figure 2: Qualitative comparison on the Hyper-NeRF(vrig) dataset hypernerf. Compared with other SOTA methods,our method reconstructs finer details of the moving objects.
  • Figure 3: Qualitative comparison on the N3DV dataset li2022neural.
  • Figure 4: Visualization of different Node init. meth. on Chicken scene of Hyper-NeRF data hypernerf.
  • Figure 5: Qualitative results of ablation.