Table of Contents
Fetching ...

Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting

Tingxuan Huang, Haowei Zhu, Jun-hai Yong, Hao Pan, Bin Wang

Abstract

Reconstructing dynamic 3D scenes with photorealistic detail and strong temporal coherence remains a significant challenge. Existing Gaussian splatting approaches for dynamic scene modeling often rely on per-frame optimization, which can overfit to instantaneous states instead of capturing underlying motion dynamics. To address this, we present Mango-GS, a multi-frame, node-guided framework for high-fidelity 4D reconstruction. Mango-GS leverages a temporal Transformer to model motion dependencies within a short window of frames, producing temporally consistent deformations. For efficiency, temporal modeling is confined to a sparse set of control nodes. Each node is represented by a decoupled canonical position and a latent code, providing a stable semantic anchor for motion propagation and preventing correspondence drift under large motion. Our framework is trained end-to-end, enhanced by an input masking strategy and two multi-frame losses to improve robustness. Extensive experiments demonstrate that Mango-GS achieves state-of-the-art reconstruction quality and real-time rendering speed, enabling high-fidelity reconstruction and interactive rendering of dynamic scenes.

Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting

Abstract

Reconstructing dynamic 3D scenes with photorealistic detail and strong temporal coherence remains a significant challenge. Existing Gaussian splatting approaches for dynamic scene modeling often rely on per-frame optimization, which can overfit to instantaneous states instead of capturing underlying motion dynamics. To address this, we present Mango-GS, a multi-frame, node-guided framework for high-fidelity 4D reconstruction. Mango-GS leverages a temporal Transformer to model motion dependencies within a short window of frames, producing temporally consistent deformations. For efficiency, temporal modeling is confined to a sparse set of control nodes. Each node is represented by a decoupled canonical position and a latent code, providing a stable semantic anchor for motion propagation and preventing correspondence drift under large motion. Our framework is trained end-to-end, enhanced by an input masking strategy and two multi-frame losses to improve robustness. Extensive experiments demonstrate that Mango-GS achieves state-of-the-art reconstruction quality and real-time rendering speed, enabling high-fidelity reconstruction and interactive rendering of dynamic scenes.
Paper Structure (18 sections, 10 equations, 7 figures, 6 tables)

This paper contains 18 sections, 10 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An overview of the Mango-GS framework. Our method is driven by a set of decoupled control nodes, each comprising a canonical position and a feature code. A dense 3D Gaussian cloud is associated with these nodes via a learned k-NN relationship based on both position and features. A temporal attention network takes the canonical node positions and a time window $[0, T]$ as input, processing them through MLP layers and temporal attention blocks to predict the nodes' deformations over the entire time window. This learned motion is then propagated back to the Gaussian cloud to produce the final dynamic scene representation for each frame. The entire model is optimized end-to-end with a temporal input masking scheme and a composite loss, which includes a top-$k$ hard-frame photometric loss, and a motion-aware temporal loss.
  • Figure 2: Position-only nodes versus decoupled nodes with position and code. We visualize 24 Gaussians (red) in high motion region and its three corresponding nodes (white). With the decoupled design, Gaussians attach to semantically consistent nodes rather than merely following spatial neighbors, which struggle under large motion.
  • Figure 3: The Architecture of the Temporal Deformation Network. For each of the $N$ control nodes, features over a window of $T$ frames are processed by an MLP backbone interleaved with temporal self-attention blocks. Attention operates along the time axis and is fused by a lightweight gate, then decoded to per-node translation, rotation, and scale for all $T$ frames.
  • Figure 4: Visualization comparison between baselines and our methods from the Neural 3D Video dataset. The main differences are highlighted and zoomed in with boxes.
  • Figure 5: Qualitative comparison on HyperNeRf. Our method offers sharp results. Differences are highlighted with boxes.
  • ...and 2 more figures