Table of Contents
Fetching ...

Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos

Zhixin Xu, Hengyu Zhou, Yuan Liu, Wenhan Xue, Hao Pan, Wenping Wang, Bin Wang

TL;DR

This work tackles the problem of reconstructing dynamic scenes with 4D Gaussian Splatting (4DGS) from unsynchronized multi-view videos by introducing a coarse-to-fine temporal alignment module. The method jointly estimates per-camera time shifts, combining a coarse frame-level search using LoFTR and RANSAC with a learnable sub-frame refinement, and is designed as a plug-in for existing 4DGS frameworks. It demonstrates significant improvements over baseline methods on challenging DyNeRF-based data, including both neural deformation and direct 4D representations, while maintaining robustness to substantial time misalignment. The approach expands practical 4D dynamic capture by enabling high-quality reconstruction with more flexible, lower-cost camera setups, and provides strong ablations showing the complementary roles of coarse and fine temporal alignment.

Abstract

Multi-view video reconstruction plays a vital role in computer vision, enabling applications in film production, virtual reality, and motion analysis. While recent advances such as 4D Gaussian Splatting (4DGS) have demonstrated impressive capabilities in dynamic scene reconstruction, they typically rely on the assumption that input video streams are temporally synchronized. However, in real-world scenarios, this assumption often fails due to factors like camera trigger delays or independent recording setups, leading to temporal misalignment across views and reduced reconstruction quality. To address this challenge, a novel temporal alignment strategy is proposed for high-quality 4DGS reconstruction from unsynchronized multi-view videos. Our method features a coarse-to-fine alignment module that estimates and compensates for each camera's time shift. The method first determines a coarse, frame-level offset and then refines it to achieve sub-frame accuracy. This strategy can be integrated as a readily integrable module into existing 4DGS frameworks, enhancing their robustness when handling asynchronous data. Experiments show that our approach effectively processes temporally misaligned videos and significantly enhances baseline methods.

Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos

TL;DR

This work tackles the problem of reconstructing dynamic scenes with 4D Gaussian Splatting (4DGS) from unsynchronized multi-view videos by introducing a coarse-to-fine temporal alignment module. The method jointly estimates per-camera time shifts, combining a coarse frame-level search using LoFTR and RANSAC with a learnable sub-frame refinement, and is designed as a plug-in for existing 4DGS frameworks. It demonstrates significant improvements over baseline methods on challenging DyNeRF-based data, including both neural deformation and direct 4D representations, while maintaining robustness to substantial time misalignment. The approach expands practical 4D dynamic capture by enabling high-quality reconstruction with more flexible, lower-cost camera setups, and provides strong ablations showing the complementary roles of coarse and fine temporal alignment.

Abstract

Multi-view video reconstruction plays a vital role in computer vision, enabling applications in film production, virtual reality, and motion analysis. While recent advances such as 4D Gaussian Splatting (4DGS) have demonstrated impressive capabilities in dynamic scene reconstruction, they typically rely on the assumption that input video streams are temporally synchronized. However, in real-world scenarios, this assumption often fails due to factors like camera trigger delays or independent recording setups, leading to temporal misalignment across views and reduced reconstruction quality. To address this challenge, a novel temporal alignment strategy is proposed for high-quality 4DGS reconstruction from unsynchronized multi-view videos. Our method features a coarse-to-fine alignment module that estimates and compensates for each camera's time shift. The method first determines a coarse, frame-level offset and then refines it to achieve sub-frame accuracy. This strategy can be integrated as a readily integrable module into existing 4DGS frameworks, enhancing their robustness when handling asynchronous data. Experiments show that our approach effectively processes temporally misaligned videos and significantly enhances baseline methods.

Paper Structure

This paper contains 21 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We introduce a novel readily integrable module that significantly enhances existing 4D dynamic scene reconstruction methods (e.g., 4DGaussians and SC-GS). Compared to the baseline result (left), our method (right) demonstrates superior quality in capturing intricate details and complex motion.
  • Figure 2: Overview of our two-stage temporal alignment pipeline. Left: Coarse Temporal Alignment estimates integer offsets $\Delta t_j^*$ by matching frames across videos using LoFTR and RANSAC. Right: Fine Temporal Alignment refines offsets with a learnable $\tau_j$. The result is supervised by a photometric reconstruction loss.
  • Figure 3: Illustration of our coarse temporal alignment. For each candidate temporal offset $\Delta t$ between two camera views, we evaluate the number of geometrically consistent feature matches, quantified by the number of RANSAC inliers. For a given view pair, we first extract putative matches using LoFTR (red dots), then apply RANSAC to robustly find inliners (green lines), which serve as our alignment score.
  • Figure 4: Corresponding frames for matching score calculation under different $\Delta t_j$s. We find the offset $\Delta t_j$ for each video $j$ relative to the reference video that maximizes the matching score.
  • Figure 5: Visual comparison of reconstruction results from unsynchronized inputs. We compare novel view synthesis results from the original 4DGaussians and RT4DGS methods against versions enhanced by our approach (+Ours). Images are from four scenes of the DyNeRF dataset: coffee_martini, cut_roasted_beef, cook_spinach, and flame_salmon.
  • ...and 1 more figures