Table of Contents
Fetching ...

Visual Sync: Multi-Camera Synchronization via Cross-View Object Motion

Shaowei Liu, David Yifan Yao, Saurabh Gupta, Shenlong Wang

TL;DR

VisualSync addresses the challenge of aligning unsynchronized, unposed multi-camera videos captured in the wild by formulating a global synchronization as the minimization of epipolar violations over dense cross-view tracklets. The method uses a three-stage pipeline: Stage 0 extracts camera poses, dense trajectories, and cross-view correspondences; Stage 1 performs pairwise discrete-time alignments by minimizing an epipolar-based energy; Stage 2 computes globally consistent offsets with robust IRLS. Across four diverse datasets, VisualSync outperforms baselines, achieving median synchronization errors around tens of milliseconds in challenging dynamic scenes, enabling accurate multi-view reconstruction and downstream tasks like novel-view synthesis. The approach demonstrates robustness to viewpoint diversity, motion blur, and moving cameras, and offers a practical offline solution for multi-camera motion understanding in real-world scenarios.

Abstract

Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera's time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.

Visual Sync: Multi-Camera Synchronization via Cross-View Object Motion

TL;DR

VisualSync addresses the challenge of aligning unsynchronized, unposed multi-camera videos captured in the wild by formulating a global synchronization as the minimization of epipolar violations over dense cross-view tracklets. The method uses a three-stage pipeline: Stage 0 extracts camera poses, dense trajectories, and cross-view correspondences; Stage 1 performs pairwise discrete-time alignments by minimizing an epipolar-based energy; Stage 2 computes globally consistent offsets with robust IRLS. Across four diverse datasets, VisualSync outperforms baselines, achieving median synchronization errors around tens of milliseconds in challenging dynamic scenes, enabling accurate multi-view reconstruction and downstream tasks like novel-view synthesis. The approach demonstrates robustness to viewpoint diversity, motion blur, and moving cameras, and offers a practical offline solution for multi-camera motion understanding in real-world scenarios.

Abstract

Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera's time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.

Paper Structure

This paper contains 23 sections, 12 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: VisualSync Overview. Given multiple unsynchronized videos capturing the same dynamic scene from different viewpoints, VisualSync recovers globally time‚Äêaligned video streams by estimating temporal offsets between views. For example, in the volleyball scene, before synchronization the player's motion is misaligned across videos; afterwards, a given timestamp in all three streams corresponds to the same moment.
  • Figure 2: Epipolar‚Äêgeometry cue for video sync: When cameras are time-aligned, keypoint tracks align with epipolar lines (bottom); misalignment causes deviations (middle). Minimizing these deviations across tracklets recovers the correct time offset.
  • Figure 3: Proposed framework: Given unsynchronized videos, VisualSync follows a three-stage pipeline. Stage 0 estimates camera parameters with VGGT wang2025vggt, dense correspondences with CoTracker3 karaev2024cotracker3, cross-view matches with MAST3R leroy2024grounding, and dynamic objects with DEVA cheng2023tracking. In Stage 1, we estimate pairwise frame offsets by minimizing epipolar violations over matched trajectories. Stage 2 globally optimizes individual offsets to produce synchronized videos.
  • Figure 4: Qualitative Comparison of synchronization on Egohumans khirodkar2023egohumans across baselines We visually assess temporal synchronization by presenting magnified views of the shuttlecock's position across time. In this complex scenario—marked by large temporal discrepancies, a small dynamic element, and moving cameras—Visual Sync achieves the most accurate alignment.
  • Figure 5: Qualitative Comparison of Video Sync across datasets. We show the synchronized videos on CMU-Panoptic, UDBD, 3D-POP and Egohumans dataset. Top 3 rows shows the estimated synchronized time stamps from 3 different views. The bottom row shows synchronized timelines across multiple videos. Our method performs robustly across diverse scenes.
  • ...and 9 more figures