Table of Contents
Fetching ...

4D Synchronized Fields: Motion-Language Gaussian Splatting for Temporal Scene Understanding

Mohamed Rayan Barhdadi, Samir Abdaljalil, Rasul Khanbayov, Erchin Serpedin, Hasan Kurban

Abstract

Current 4D representations decouple geometry, motion, and semantics: reconstruction methods discard interpretable motion structure; language-grounded methods attach semantics after motion is learned, blind to how objects move; and motion-aware methods encode dynamics as opaque per-point residuals without object-level organization. We propose 4D Synchronized Fields, a 4D Gaussian representation that learns object-factored motion in-loop during reconstruction and synchronizes language to the resulting kinematics through a per-object conditioned field. Each Gaussian trajectory is decomposed into shared object motion plus an implicit residual, and a kinematic-conditioned ridge map predicts temporal semantic variation, yielding a single representation in which reconstruction, motion, and semantics are structurally coupled and enabling open-vocabulary temporal queries that retrieve both objects and moments. On HyperNeRF, 4D Synchronized Fields achieves 28.52 dB mean PSNR, the highest among all language-grounded and motion-aware baselines, within 1.5 dB of reconstruction-only methods. On targeted temporal-state retrieval, the kinematic-conditioned field attains 0.884 mean accuracy, 0.815 mean vIoU, and 0.733 mean tIoU, surpassing 4D LangSplat (0.620, 0.433, and 0.439 respectively) and LangSplat (0.415, 0.304, and 0.262). Ablation confirms that kinematic conditioning is the primary driver, accounting for +0.45 tIoU over a static-embedding-only baseline. 4D Synchronized Fields is the only method that jointly exposes interpretable motion primitives and temporally grounded language fields from a single trained representation. Code will be released.

4D Synchronized Fields: Motion-Language Gaussian Splatting for Temporal Scene Understanding

Abstract

Current 4D representations decouple geometry, motion, and semantics: reconstruction methods discard interpretable motion structure; language-grounded methods attach semantics after motion is learned, blind to how objects move; and motion-aware methods encode dynamics as opaque per-point residuals without object-level organization. We propose 4D Synchronized Fields, a 4D Gaussian representation that learns object-factored motion in-loop during reconstruction and synchronizes language to the resulting kinematics through a per-object conditioned field. Each Gaussian trajectory is decomposed into shared object motion plus an implicit residual, and a kinematic-conditioned ridge map predicts temporal semantic variation, yielding a single representation in which reconstruction, motion, and semantics are structurally coupled and enabling open-vocabulary temporal queries that retrieve both objects and moments. On HyperNeRF, 4D Synchronized Fields achieves 28.52 dB mean PSNR, the highest among all language-grounded and motion-aware baselines, within 1.5 dB of reconstruction-only methods. On targeted temporal-state retrieval, the kinematic-conditioned field attains 0.884 mean accuracy, 0.815 mean vIoU, and 0.733 mean tIoU, surpassing 4D LangSplat (0.620, 0.433, and 0.439 respectively) and LangSplat (0.415, 0.304, and 0.262). Ablation confirms that kinematic conditioning is the primary driver, accounting for +0.45 tIoU over a static-embedding-only baseline. 4D Synchronized Fields is the only method that jointly exposes interpretable motion primitives and temporally grounded language fields from a single trained representation. Code will be released.
Paper Structure (64 sections, 22 equations, 3 figures, 6 tables)

This paper contains 64 sections, 22 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: 4D Synchronized Fields learns a deformable 4D Gaussian scene whose per-Gaussian trajectories are decomposed in-loop into shared object motion plus implicit residuals. The resulting synchronized tracks and kinematics condition an object-time language field, trained from projected object crops and a per-object ridge map, enabling open-vocabulary temporal queries that retrieve both objects and moments, grounded in learned motion structure.
  • Figure 2: Method overview. A time-conditioned deformation MLP $\mathcal{D}_\theta$ predicts per-Gaussian deltas yielding deformed positions $\mathbf{x}_i(t)$. A shared object-motion model $\mathcal{M}_\phi$ produces per-object transforms yielding object-predicted positions $\tilde{\mathbf{x}}_i(t)$. The residual $\mathbf{r}_i(t)=\mathbf{x}_i(t)-\tilde{\mathbf{x}}_i(t)$ is defined implicitly and used only in regularizers; rendering uses $\mathbf{x}_i(t)$ unchanged. After training, synchronized tracks and 28D kinematic features are extracted and used to fit per-object ridge maps from kinematics to semantic residuals, yielding an object-time language field for open-vocabulary temporal queries; the resulting structured scene description is then leveraged by a multimodal LLM for downstream reasoning.
  • Figure 3: Targeted temporal-state retrieval on americano (left: "glass in luminous-liquid phase") and espresso (right: "glass cup with liquid above midpoint"). Rows: RGB input, 4D LangSplat activation, 4D Synchronized Fields activation, ground truth. 4D LangSplat activates broadly across the temporal extent; 4D Synchronized Fields produces tighter activations aligned with ground-truth intervals, reflecting kinematic conditioning's sensitivity to motion-correlated state transitions.