Table of Contents
Fetching ...

RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems

Jaro Meyer, Frédéric Giraud, Joschua Wüthrich, Marc Pollefeys, Philipp Fürnstahl, Lilian Calvet

TL;DR

RocSync tackles the problem of millisecond-accurate temporal alignment across heterogeneous cameras without hardware synchronization by introducing a low-cost LED Clock that visually encodes timestamps. The clock uses a circular ring of LEDs advancing at 1 ms steps plus a binary counter, enabling extraction of precise start and end exposure times from frames; global timestamps follow $T_i^{j} = abla_j \, t_i^{j} + eta_j$, estimated per camera with robust linear regression. In extensive experiments with up to 25 cameras across RGB and IR modalities, RocSync achieves an RMSE of about $1.34$ ms against hardware ground truth and delivers substantial improvements in downstream tasks such as multi-view pose estimation and 3D reconstruction via sub-frame interpolation. The approach is low-cost, camera-agnostic, and openly available, extending high-quality vision-based sensing to unconstrained industrial and clinical environments.

Abstract

Accurate spatiotemporal alignment of multi-view video streams is essential for a wide range of dynamic-scene applications such as multi-view 3D reconstruction, pose estimation, and scene understanding. However, synchronizing multiple cameras remains a significant challenge, especially in heterogeneous setups combining professional and consumer-grade devices, visible and infrared sensors, or systems with and without audio, where common hardware synchronization capabilities are often unavailable. This limitation is particularly evident in real-world environments, where controlled capture conditions are not feasible. In this work, we present a low-cost, general-purpose synchronization method that achieves millisecond-level temporal alignment across diverse camera systems while supporting both visible (RGB) and infrared (IR) modalities. The proposed solution employs a custom-built \textit{LED Clock} that encodes time through red and infrared LEDs, allowing visual decoding of the exposure window (start and end times) from recorded frames for millisecond-level synchronization. We benchmark our method against hardware synchronization and achieve a residual error of 1.34~ms RMSE across multiple recordings. In further experiments, our method outperforms light-, audio-, and timecode-based synchronization approaches and directly improves downstream computer vision tasks, including multi-view pose estimation and 3D reconstruction. Finally, we validate the system in large-scale surgical recordings involving over 25 heterogeneous cameras spanning both IR and RGB modalities. This solution simplifies and streamlines the synchronization pipeline and expands access to advanced vision-based sensing in unconstrained environments, including industrial and clinical applications.

RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems

TL;DR

RocSync tackles the problem of millisecond-accurate temporal alignment across heterogeneous cameras without hardware synchronization by introducing a low-cost LED Clock that visually encodes timestamps. The clock uses a circular ring of LEDs advancing at 1 ms steps plus a binary counter, enabling extraction of precise start and end exposure times from frames; global timestamps follow , estimated per camera with robust linear regression. In extensive experiments with up to 25 cameras across RGB and IR modalities, RocSync achieves an RMSE of about ms against hardware ground truth and delivers substantial improvements in downstream tasks such as multi-view pose estimation and 3D reconstruction via sub-frame interpolation. The approach is low-cost, camera-agnostic, and openly available, extending high-quality vision-based sensing to unconstrained industrial and clinical environments.

Abstract

Accurate spatiotemporal alignment of multi-view video streams is essential for a wide range of dynamic-scene applications such as multi-view 3D reconstruction, pose estimation, and scene understanding. However, synchronizing multiple cameras remains a significant challenge, especially in heterogeneous setups combining professional and consumer-grade devices, visible and infrared sensors, or systems with and without audio, where common hardware synchronization capabilities are often unavailable. This limitation is particularly evident in real-world environments, where controlled capture conditions are not feasible. In this work, we present a low-cost, general-purpose synchronization method that achieves millisecond-level temporal alignment across diverse camera systems while supporting both visible (RGB) and infrared (IR) modalities. The proposed solution employs a custom-built \textit{LED Clock} that encodes time through red and infrared LEDs, allowing visual decoding of the exposure window (start and end times) from recorded frames for millisecond-level synchronization. We benchmark our method against hardware synchronization and achieve a residual error of 1.34~ms RMSE across multiple recordings. In further experiments, our method outperforms light-, audio-, and timecode-based synchronization approaches and directly improves downstream computer vision tasks, including multi-view pose estimation and 3D reconstruction. Finally, we validate the system in large-scale surgical recordings involving over 25 heterogeneous cameras spanning both IR and RGB modalities. This solution simplifies and streamlines the synchronization pipeline and expands access to advanced vision-based sensing in unconstrained environments, including industrial and clinical applications.

Paper Structure

This paper contains 46 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Our custom LED Clock. (a) Example of the device in use. (b) Rendering of the final PCB design with annotated features (red features encode the timestamp and blue features are used for detection & Euclidean rectification). The exposure window of the camera is visible in (a) as a red elliptical arc, which can be decoded with an accuracy of about 1 ms to enable sub-frame temporal synchronization.
  • Figure 2: Our computer vision pipeline: (a) raw image with ArUco detection, (b) coarse homography reprojection using the ArUco marker with detected blobs shown in red circles, and (c) decoded LEDs after refined homography reprojection using the corner blobs from the previous step. All LED positions are marked with blue circles, while illuminated LEDs are highlighted in red. In this example, the binary counter shows 12 and the illuminated ring segment indicates that the exposure was taken from LED 40 to LED 57. Combining this information, we know that the exact exposure of this frame occurred from $12*100+40=1240\,\text{ms}$ to $12*100+57=1257\,\text{ms}$.
  • Figure 3: Mean Euclidean distance between two independent 3D hand pose reconstructions obtained from different camera subsets (GoPros 1&2 and 3&4) at $T_5$, (a) with and (b) without sub-frame synchronization. Sub-frame synchronization substantially reduces the discrepancy between reconstructions, highlighting its importance for accurate 3D hand pose estimation.
  • Figure 4: Qualitative comparison of 3D hand reconstructions obtained (a) with nearest-frame synchronization and (b) with sub-frame synchronization (triangulation from GoPro 3 and 4), for the scene shown in Figure \ref{['fig:hand_scene']} (left hand). The nearest-frame approach produces noticeable discrepancies in hand shape, whereas sub-frame synchronization yields a reconstruction consistent with the true hand configuration.
  • Figure 5: Comparison of 3D reconstruction results without and with sub-frame synchronization. Top: reference real images corresponding to the reconstructions shown in the middle panels (a) and the bottom panels (b), respectively. Middle: The left hand using a nearest-frame synchronization strategy presents large reconstruction artifacts (c) while being successfully reconstructed using the proposed method that allows for sub-frame synchronization (d). Bottom: 3D reconstruction based on the nearest frame-based synchronization strategy fails in reconstructing the hammer (e) which is mostly reconstructed using sub-frame interpolation (f) (see image center). Panels (c)–(d) use footage from GoPro 1, GoPro 3, and GoPro 4; panels (e)–(f) use GoPro 2 and GoPro 3.
  • ...and 2 more figures