FLAG-4D: Flow-Guided Local-Global Dual-Deformation Model for 4D Reconstruction
Guan Yuan Tan, Ngoc Tuan Vu, Arghya Pal, Sailaja Rajanala, Raphael Phan C. -W., Mettu Srinivas, Chee-Ming Ting
TL;DR
FLAG-4D addresses 4D reconstruction by decoupling motion into local and global components through a dual-deformation pipeline. It couples an Instantaneous Deformation Network (IDN) for fine-grained local motion with a Global Motion Network (GMN) for long-range dynamics, guided by time embeddings and mutual learning, and grounded on a canonical Gaussian scene representation derived from SfM. A Temporal Fusion Encoder processes pretrained optical-flow embeddings, and Contextual Deformation Alignment (CDA) enables targeted cross-attention between local forecasts and global context, with a Deformation Refinement Network (DRN) integrating both cues. Empirically, FLAG-4D achieves state-of-the-art fidelity and temporal coherence on NeRF-DS and HyperNeRF benchmarks, preserving texture and structure across dynamic viewpoints. The approach reduces the classic local-global trade-off in 4D reconstruction and demonstrates robust performance under monocular input conditions, with potential for further flow-agnostic robustness.
Abstract
We introduce FLAG-4D, a novel framework for generating novel views of dynamic scenes by reconstructing how 3D Gaussian primitives evolve through space and time. Existing methods typically rely on a single Multilayer Perceptron (MLP) to model temporal deformations, and they often struggle to capture complex point motions and fine-grained dynamic details consistently over time, especially from sparse input views. Our approach, FLAG-4D, overcomes this by employing a dual-deformation network that dynamically warps a canonical set of 3D Gaussians over time into new positions and anisotropic shapes. This dual-deformation network consists of an Instantaneous Deformation Network (IDN) for modeling fine-grained, local deformations and a Global Motion Network (GMN) for capturing long-range dynamics, refined through mutual learning. To ensure these deformations are both accurate and temporally smooth, FLAG-4D incorporates dense motion features from a pretrained optical flow backbone. We fuse these motion cues from adjacent timeframes and use a deformation-guided attention mechanism to align this flow information with the current state of each evolving 3D Gaussian. Extensive experiments demonstrate that FLAG-4D achieves higher-fidelity and more temporally coherent reconstructions with finer detail preservation than state-of-the-art methods.
