RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems
Jaro Meyer, Frédéric Giraud, Joschua Wüthrich, Marc Pollefeys, Philipp Fürnstahl, Lilian Calvet
TL;DR
RocSync tackles the problem of millisecond-accurate temporal alignment across heterogeneous cameras without hardware synchronization by introducing a low-cost LED Clock that visually encodes timestamps. The clock uses a circular ring of LEDs advancing at 1 ms steps plus a binary counter, enabling extraction of precise start and end exposure times from frames; global timestamps follow $T_i^{j} = abla_j \, t_i^{j} + eta_j$, estimated per camera with robust linear regression. In extensive experiments with up to 25 cameras across RGB and IR modalities, RocSync achieves an RMSE of about $1.34$ ms against hardware ground truth and delivers substantial improvements in downstream tasks such as multi-view pose estimation and 3D reconstruction via sub-frame interpolation. The approach is low-cost, camera-agnostic, and openly available, extending high-quality vision-based sensing to unconstrained industrial and clinical environments.
Abstract
Accurate spatiotemporal alignment of multi-view video streams is essential for a wide range of dynamic-scene applications such as multi-view 3D reconstruction, pose estimation, and scene understanding. However, synchronizing multiple cameras remains a significant challenge, especially in heterogeneous setups combining professional and consumer-grade devices, visible and infrared sensors, or systems with and without audio, where common hardware synchronization capabilities are often unavailable. This limitation is particularly evident in real-world environments, where controlled capture conditions are not feasible. In this work, we present a low-cost, general-purpose synchronization method that achieves millisecond-level temporal alignment across diverse camera systems while supporting both visible (RGB) and infrared (IR) modalities. The proposed solution employs a custom-built \textit{LED Clock} that encodes time through red and infrared LEDs, allowing visual decoding of the exposure window (start and end times) from recorded frames for millisecond-level synchronization. We benchmark our method against hardware synchronization and achieve a residual error of 1.34~ms RMSE across multiple recordings. In further experiments, our method outperforms light-, audio-, and timecode-based synchronization approaches and directly improves downstream computer vision tasks, including multi-view pose estimation and 3D reconstruction. Finally, we validate the system in large-scale surgical recordings involving over 25 heterogeneous cameras spanning both IR and RGB modalities. This solution simplifies and streamlines the synchronization pipeline and expands access to advanced vision-based sensing in unconstrained environments, including industrial and clinical applications.
