Table of Contents
Fetching ...

Sync-NeRF: Generalizing Dynamic NeRFs to Unsynchronized Videos

Seoha Kim, Jeongmin Bae, Youngsik Yun, Hahyun Lee, Gun Bang, Youngjung Uh

TL;DR

Sync-NeRF addresses the challenge of unsynchronized multi-view videos in dynamic NeRFs by learning per-camera time offsets that jointly align observations with the scene's temporal dynamics. It supports both implicit temporal embeddings and grid-based representations, converting misalignment into optimizable parameters and thereby improving reconstruction quality without manual synchronization. The approach demonstrates strong gains on unsynchronized datasets and maintains advantages even when inputs are nearly synchronized, underscoring its practical impact for real-world, in-the-wild videography. By generalizing across baselines like MixVoxels and K-Planes, Sync-NeRF provides a versatile framework to robustly render dynamic scenes from imperfect multi-view data.

Abstract

Recent advancements in 4D scene reconstruction using neural radiance fields (NeRF) have demonstrated the ability to represent dynamic scenes from multi-view videos. However, they fail to reconstruct the dynamic scenes and struggle to fit even the training views in unsynchronized settings. It happens because they employ a single latent embedding for a frame while the multi-view images at the same frame were actually captured at different moments. To address this limitation, we introduce time offsets for individual unsynchronized videos and jointly optimize the offsets with NeRF. By design, our method is applicable for various baselines and improves them with large margins. Furthermore, finding the offsets naturally works as synchronizing the videos without manual effort. Experiments are conducted on the common Plenoptic Video Dataset and a newly built Unsynchronized Dynamic Blender Dataset to verify the performance of our method. Project page: https://seoha-kim.github.io/sync-nerf

Sync-NeRF: Generalizing Dynamic NeRFs to Unsynchronized Videos

TL;DR

Sync-NeRF addresses the challenge of unsynchronized multi-view videos in dynamic NeRFs by learning per-camera time offsets that jointly align observations with the scene's temporal dynamics. It supports both implicit temporal embeddings and grid-based representations, converting misalignment into optimizable parameters and thereby improving reconstruction quality without manual synchronization. The approach demonstrates strong gains on unsynchronized datasets and maintains advantages even when inputs are nearly synchronized, underscoring its practical impact for real-world, in-the-wild videography. By generalizing across baselines like MixVoxels and K-Planes, Sync-NeRF provides a versatile framework to robustly render dynamic scenes from imperfect multi-view data.

Abstract

Recent advancements in 4D scene reconstruction using neural radiance fields (NeRF) have demonstrated the ability to represent dynamic scenes from multi-view videos. However, they fail to reconstruct the dynamic scenes and struggle to fit even the training views in unsynchronized settings. It happens because they employ a single latent embedding for a frame while the multi-view images at the same frame were actually captured at different moments. To address this limitation, we introduce time offsets for individual unsynchronized videos and jointly optimize the offsets with NeRF. By design, our method is applicable for various baselines and improves them with large margins. Furthermore, finding the offsets naturally works as synchronizing the videos without manual effort. Experiments are conducted on the common Plenoptic Video Dataset and a newly built Unsynchronized Dynamic Blender Dataset to verify the performance of our method. Project page: https://seoha-kim.github.io/sync-nerf
Paper Structure (40 sections, 6 equations, 13 figures, 15 tables)

This paper contains 40 sections, 6 equations, 13 figures, 15 tables.

Figures (13)

  • Figure 1: Overview. (a) The commonly used Plenoptic Video Dataset in 4D scene reconstruction contains an unsynchronized video. Image patches are all first frames. (b) If we include this view in the training set, baselines fail to reconstruct the motion around the unsynchronized viewpoint. (c) In the same settings, our method significantly outperforms.
  • Figure 2: Problem statement. (a) Ideally, all multi-view images at a frame captures the same moment of a scene. Each frame is represented by a latent embedding. (b) Some frames are not synchronized. Previous methods suffer from the discrepancy between the latent embedding of the frame and the actual status of the scene. (c) Our method allows assigning correct temporal latent embeddings to videos captured with temporal gaps by introducing learnable time offsets $\delta$ for individual cameras.
  • Figure 3: Learning curve of time offsets. We show camera offsets in coffee_martini scene along the training iterations of Sync-MixVoxels. Our method successfully finds the offset of the outlier camera.
  • Figure 4: Synchronization with time offsets. (a) For given unsynchronized videos, (b) our method finds the time offsets $\delta$ which are equivalent to (c) automatically synchronizing the videos.
  • Figure 5: Continuous temporal embedding. (a) We present an implicit function-based approach for the methods utilizing per-frame temporal embeddings. We add time offset $\delta_k$ of camera $k$ to time input $t$. $\mathcal{T}_\theta$ is the implicit function for mapping calibrated time into temporal embedding $\mathbf{z}$. (b) We query the embedding at the calibrated time $t_k$ on grid-based models. Bilinear interpolation naturally allows continuous temporal embedding.
  • ...and 8 more figures