Table of Contents
Fetching ...

FLAG-4D: Flow-Guided Local-Global Dual-Deformation Model for 4D Reconstruction

Guan Yuan Tan, Ngoc Tuan Vu, Arghya Pal, Sailaja Rajanala, Raphael Phan C. -W., Mettu Srinivas, Chee-Ming Ting

TL;DR

FLAG-4D addresses 4D reconstruction by decoupling motion into local and global components through a dual-deformation pipeline. It couples an Instantaneous Deformation Network (IDN) for fine-grained local motion with a Global Motion Network (GMN) for long-range dynamics, guided by time embeddings and mutual learning, and grounded on a canonical Gaussian scene representation derived from SfM. A Temporal Fusion Encoder processes pretrained optical-flow embeddings, and Contextual Deformation Alignment (CDA) enables targeted cross-attention between local forecasts and global context, with a Deformation Refinement Network (DRN) integrating both cues. Empirically, FLAG-4D achieves state-of-the-art fidelity and temporal coherence on NeRF-DS and HyperNeRF benchmarks, preserving texture and structure across dynamic viewpoints. The approach reduces the classic local-global trade-off in 4D reconstruction and demonstrates robust performance under monocular input conditions, with potential for further flow-agnostic robustness.

Abstract

We introduce FLAG-4D, a novel framework for generating novel views of dynamic scenes by reconstructing how 3D Gaussian primitives evolve through space and time. Existing methods typically rely on a single Multilayer Perceptron (MLP) to model temporal deformations, and they often struggle to capture complex point motions and fine-grained dynamic details consistently over time, especially from sparse input views. Our approach, FLAG-4D, overcomes this by employing a dual-deformation network that dynamically warps a canonical set of 3D Gaussians over time into new positions and anisotropic shapes. This dual-deformation network consists of an Instantaneous Deformation Network (IDN) for modeling fine-grained, local deformations and a Global Motion Network (GMN) for capturing long-range dynamics, refined through mutual learning. To ensure these deformations are both accurate and temporally smooth, FLAG-4D incorporates dense motion features from a pretrained optical flow backbone. We fuse these motion cues from adjacent timeframes and use a deformation-guided attention mechanism to align this flow information with the current state of each evolving 3D Gaussian. Extensive experiments demonstrate that FLAG-4D achieves higher-fidelity and more temporally coherent reconstructions with finer detail preservation than state-of-the-art methods.

FLAG-4D: Flow-Guided Local-Global Dual-Deformation Model for 4D Reconstruction

TL;DR

FLAG-4D addresses 4D reconstruction by decoupling motion into local and global components through a dual-deformation pipeline. It couples an Instantaneous Deformation Network (IDN) for fine-grained local motion with a Global Motion Network (GMN) for long-range dynamics, guided by time embeddings and mutual learning, and grounded on a canonical Gaussian scene representation derived from SfM. A Temporal Fusion Encoder processes pretrained optical-flow embeddings, and Contextual Deformation Alignment (CDA) enables targeted cross-attention between local forecasts and global context, with a Deformation Refinement Network (DRN) integrating both cues. Empirically, FLAG-4D achieves state-of-the-art fidelity and temporal coherence on NeRF-DS and HyperNeRF benchmarks, preserving texture and structure across dynamic viewpoints. The approach reduces the classic local-global trade-off in 4D reconstruction and demonstrates robust performance under monocular input conditions, with potential for further flow-agnostic robustness.

Abstract

We introduce FLAG-4D, a novel framework for generating novel views of dynamic scenes by reconstructing how 3D Gaussian primitives evolve through space and time. Existing methods typically rely on a single Multilayer Perceptron (MLP) to model temporal deformations, and they often struggle to capture complex point motions and fine-grained dynamic details consistently over time, especially from sparse input views. Our approach, FLAG-4D, overcomes this by employing a dual-deformation network that dynamically warps a canonical set of 3D Gaussians over time into new positions and anisotropic shapes. This dual-deformation network consists of an Instantaneous Deformation Network (IDN) for modeling fine-grained, local deformations and a Global Motion Network (GMN) for capturing long-range dynamics, refined through mutual learning. To ensure these deformations are both accurate and temporally smooth, FLAG-4D incorporates dense motion features from a pretrained optical flow backbone. We fuse these motion cues from adjacent timeframes and use a deformation-guided attention mechanism to align this flow information with the current state of each evolving 3D Gaussian. Extensive experiments demonstrate that FLAG-4D achieves higher-fidelity and more temporally coherent reconstructions with finer detail preservation than state-of-the-art methods.
Paper Structure (27 sections, 12 equations, 6 figures, 3 tables)

This paper contains 27 sections, 12 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Visual Comparison of our method against very recent methods, such as 4DGS wu20244d, SC-GS huang2024sc, and D-MiSo waczynska2024d on the HyperNeRF dataset. Our method demonstrates finer detail preservation across timesteps, particularly in the texture and edges of the zoomed-in regions. This results in a higher fidelity and coherent reconstruction across dynamic viewpoints compared to the baseline.
  • Figure 2: FLAG-4D Methodology: Our dual-deformation framework for 4D reconstruction. Top: The overall pipeline: A monocular video sequence is used to generate an initial SfM point cloud, from which a canonical set of 3D Gaussians at $t=0$ is derived. The Dual-Deformation Network consists of an Instantaneous Deformation Network (IDN) and a Global Motion Network (GMN), which are trained synergistically through Mutual Learning. Bottom Left: The IDN processes the canonical Gaussians and a window of future-oriented time embeddings to produce a hypothesized local deformation. Bottom Right: The GMN integrates this local deformation hypothesis (as Query) with fused optical flow embeddings (as Key/Value) via a cross-attention mechanism, producing the final globally consistent deformation.
  • Figure 3: Comparison of Predicted Gaussian Deformation ($t \rightarrow t+10$) for the "Bell" Scene. (a) Frame 0 (GT). (b) Frame 10 (GT). (c) FLAG-4D (Ours) accumulated flow from $t=0$ to $t=10$, overlaid on Frame 0. (d) Deformable 3D Gaussians yang2024deformable accumulated flow. FLAG-4D produces a more accurate and coherent deformation field. The highlighted region (red box) demonstrates our method's superior preservation of the bell's local rigidity during its motion towards the state in Frame 10.
  • Figure 4: Qualitative comparisons between baseline methods and our approach on the NeRF-DS real-world dataset. Results show that our method delivers superior rendering quality in the case of complex scene dynamics. Our method is capable of capturing finer details, preserving complex structure, and handling the dynamic scene elements more effectively than baseline methods such as SC-GS huang2024sc, D-MiSo waczynska2024d, and 4DGS wu20244d.
  • Figure 5: Rendered Depth Map Comparison FLAG-4D produces depth maps with visibly greater detail and geometric accuracy (e.g., printer's string and surface texture).
  • ...and 1 more figures