Table of Contents
Fetching ...

SyncTrack4D: Cross-Video Motion Alignment and Video Synchronization for Multi-Video 4D Gaussian Splatting

Yonghan Lee, Tsung-Wei Huang, Shiv Gehlot, Jaehoon Choi, Guan-Ming Su, Dinesh Manocha

TL;DR

SyncTrack4D introduces a general framework for reconstructing dynamic scenes from unsynchronized multi-view videos by leveraging dense 4D tracks for cross-video synchronization and 4D Gaussian Splatting. It combines Fused Gromov–Wasserstein track matching with dynamic Time Warping and a motion-spline scaffold to jointly align temporal offsets and optimize a unified 4DGS representation. The approach achieves sub-frame synchronization (average ~$0.26$ frames) and high-fidelity 4D reconstructions (PSNR > $26$) on real-world datasets, without relying on predefined scene templates. This work broadens 4DGaussianSplatting to unsynchronized multi-view settings, enabling robust dynamic scene capture in unconstrained environments.

Abstract

Modeling dynamic 3D scenes is challenging due to their high-dimensional nature, which requires aggregating information from multiple views to reconstruct time-evolving 3D geometry and motion. We present a novel multi-video 4D Gaussian Splatting (4DGS) approach designed to handle real-world, unsynchronized video sets. Our approach, SyncTrack4D, directly leverages dense 4D track representation of dynamic scene parts as cues for simultaneous cross-video synchronization and 4DGS reconstruction. We first compute dense per-video 4D feature tracks and cross-video track correspondences by Fused Gromov-Wasserstein optimal transport approach. Next, we perform global frame-level temporal alignment to maximize overlapping motion of matched 4D tracks. Finally, we achieve sub-frame synchronization through our multi-video 4D Gaussian splatting built upon a motion-spline scaffold representation. The final output is a synchronized 4DGS representation with dense, explicit 3D trajectories, and temporal offsets for each video. We evaluate our approach on the Panoptic Studio and SyncNeRF Blender, demonstrating sub-frame synchronization accuracy with an average temporal error below 0.26 frames, and high-fidelity 4D reconstruction reaching 26.3 PSNR scores on the Panoptic Studio dataset. To the best of our knowledge, our work is the first general 4D Gaussian Splatting approach for unsynchronized video sets, without assuming the existence of predefined scene objects or prior models.

SyncTrack4D: Cross-Video Motion Alignment and Video Synchronization for Multi-Video 4D Gaussian Splatting

TL;DR

SyncTrack4D introduces a general framework for reconstructing dynamic scenes from unsynchronized multi-view videos by leveraging dense 4D tracks for cross-video synchronization and 4D Gaussian Splatting. It combines Fused Gromov–Wasserstein track matching with dynamic Time Warping and a motion-spline scaffold to jointly align temporal offsets and optimize a unified 4DGS representation. The approach achieves sub-frame synchronization (average ~ frames) and high-fidelity 4D reconstructions (PSNR > ) on real-world datasets, without relying on predefined scene templates. This work broadens 4DGaussianSplatting to unsynchronized multi-view settings, enabling robust dynamic scene capture in unconstrained environments.

Abstract

Modeling dynamic 3D scenes is challenging due to their high-dimensional nature, which requires aggregating information from multiple views to reconstruct time-evolving 3D geometry and motion. We present a novel multi-video 4D Gaussian Splatting (4DGS) approach designed to handle real-world, unsynchronized video sets. Our approach, SyncTrack4D, directly leverages dense 4D track representation of dynamic scene parts as cues for simultaneous cross-video synchronization and 4DGS reconstruction. We first compute dense per-video 4D feature tracks and cross-video track correspondences by Fused Gromov-Wasserstein optimal transport approach. Next, we perform global frame-level temporal alignment to maximize overlapping motion of matched 4D tracks. Finally, we achieve sub-frame synchronization through our multi-video 4D Gaussian splatting built upon a motion-spline scaffold representation. The final output is a synchronized 4DGS representation with dense, explicit 3D trajectories, and temporal offsets for each video. We evaluate our approach on the Panoptic Studio and SyncNeRF Blender, demonstrating sub-frame synchronization accuracy with an average temporal error below 0.26 frames, and high-fidelity 4D reconstruction reaching 26.3 PSNR scores on the Panoptic Studio dataset. To the best of our knowledge, our work is the first general 4D Gaussian Splatting approach for unsynchronized video sets, without assuming the existence of predefined scene objects or prior models.

Paper Structure

This paper contains 14 sections, 6 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: We present a general approach for 4D scene reconstruction from unsynchronized video sets. Our multi-stage approach jointly solves video synchronization and 4D Gaussian Splatting (4DGS) reconstruction by leveraging dense 4D pixel tracks as cues for motion matching and geometry recovery. From unsynchronized inputs, we estimate and align dense 4D tracks across videos, followed by refinement using high-fidelity photometric optimization within 4DGS. This result in a synchronized 4DGS representation with estimated per-video temporal offsets. Most of our 4D results can be best viewed in the supplementary videos.
  • Figure 2: SyncTrack4D Pipeline. Given unsynchronized multi-video RGB inputs, we extract diverse 2D priors along with depths and camera poses from feed-forward multi-view models or sensors. (1) For each monocular video, we estimate 4D tracks and embed feature maps through 4DGS optimization. (2) We perform dense cross-video 4D track matching via a Fused Gromov–Wasserstein formulation that fuses feature similarity and geometric structure. (3) The resulting correspondences enable frame-level synchronization by minimizing inter-video motion discrepancies. (4) Finally, we aggregate all per-video 4D tracks with their initial offsets and jointly refine synchronization and geometry with a unified multi-video 4DGS. Our pipeline produces dense cross-video correspondences, a unified 4DGS model, and accurate per-video time offsets.
  • Figure 3: Feature-Only Optimal Transport matches and Fused Gromov–Wasserstein (FGW) matches. FGW produces geometrically more coherent correspondences by jointly modeling feature similarity and structural consistency.
  • Figure 4: Synchronization examples using Dynamic Time Warping (DTW). (a) Boxes, (b) Softball. DTW computes optimal monotonic correspondences (red line) between two temporal sequences. The optimal offset (green line) is selected as the mode of all estimated pairwise offsets. The softball scenes exhibit more distinctive cost maps due to their rich motion patterns.
  • Figure 5: Training samples from our multi-video 4DGS optimization. Initially unsynchronized per-video 4DGS set (left) is converging to synchronized representation (right) with photometric supervision.
  • ...and 4 more figures