Table of Contents
Fetching ...

Joint Spatial-Temporal Calibration for Camera and Global Pose Sensor

Junlin Song, Antoine Richard, Miguel Olivares-Mendez

TL;DR

This work tackles the problem of calibrating the spatial-temporal relationship between a monocular camera and a global pose sensor. It introduces a dual approach: an offline target-based calibration that jointly optimizes camera intrinsics, the camera-to-marker transform ${}_M^C T$, a global pose transform ${}_W^G T$, and the trajectory using a grid of AprilTags, and an online target-less EKF-based method that estimates time-varying spatial-temporal parameters without requiring targets. The authors provide a detailed observability analysis showing full observability under fully excited 6DoF motions and characterize degenerate motions, complemented by simulation studies and real-world hand-held experiments (including sequences with and without targets) that validate accuracy and consistency, as well as the ability to track time-varying parameters. The methods enable accurate evaluation of localization algorithms and broader computer-vision tasks that rely on precise spatial-temporal alignment, even when traditional hand-eye calibration is infeasible. The work also derives analytical on-manifold Jacobians for the target-based method, integrates robust data fusion, and demonstrates practical applicability on public datasets like TUM-VI, with implications for visual SLAM, annotation, and multi-sensor fusion.

Abstract

In robotics, motion capture systems have been widely used to measure the accuracy of localization algorithms. Moreover, this infrastructure can also be used for other computer vision tasks, such as the evaluation of Visual (-Inertial) SLAM dynamic initialization, multi-object tracking, or automatic annotation. Yet, to work optimally, these functionalities require having accurate and reliable spatial-temporal calibration parameters between the camera and the global pose sensor. In this study, we provide two novel solutions to estimate these calibration parameters. Firstly, we design an offline target-based method with high accuracy and consistency. Spatial-temporal parameters, camera intrinsic, and trajectory are optimized simultaneously. Then, we propose an online target-less method, eliminating the need for a calibration target and enabling the estimation of time-varying spatial-temporal parameters. Additionally, we perform detailed observability analysis for the target-less method. Our theoretical findings regarding observability are validated by simulation experiments and provide explainable guidelines for calibration. Finally, the accuracy and consistency of two proposed methods are evaluated with hand-held real-world datasets where traditional hand-eye calibration method do not work.

Joint Spatial-Temporal Calibration for Camera and Global Pose Sensor

TL;DR

This work tackles the problem of calibrating the spatial-temporal relationship between a monocular camera and a global pose sensor. It introduces a dual approach: an offline target-based calibration that jointly optimizes camera intrinsics, the camera-to-marker transform , a global pose transform , and the trajectory using a grid of AprilTags, and an online target-less EKF-based method that estimates time-varying spatial-temporal parameters without requiring targets. The authors provide a detailed observability analysis showing full observability under fully excited 6DoF motions and characterize degenerate motions, complemented by simulation studies and real-world hand-held experiments (including sequences with and without targets) that validate accuracy and consistency, as well as the ability to track time-varying parameters. The methods enable accurate evaluation of localization algorithms and broader computer-vision tasks that rely on precise spatial-temporal alignment, even when traditional hand-eye calibration is infeasible. The work also derives analytical on-manifold Jacobians for the target-based method, integrates robust data fusion, and demonstrates practical applicability on public datasets like TUM-VI, with implications for visual SLAM, annotation, and multi-sensor fusion.

Abstract

In robotics, motion capture systems have been widely used to measure the accuracy of localization algorithms. Moreover, this infrastructure can also be used for other computer vision tasks, such as the evaluation of Visual (-Inertial) SLAM dynamic initialization, multi-object tracking, or automatic annotation. Yet, to work optimally, these functionalities require having accurate and reliable spatial-temporal calibration parameters between the camera and the global pose sensor. In this study, we provide two novel solutions to estimate these calibration parameters. Firstly, we design an offline target-based method with high accuracy and consistency. Spatial-temporal parameters, camera intrinsic, and trajectory are optimized simultaneously. Then, we propose an online target-less method, eliminating the need for a calibration target and enabling the estimation of time-varying spatial-temporal parameters. Additionally, we perform detailed observability analysis for the target-less method. Our theoretical findings regarding observability are validated by simulation experiments and provide explainable guidelines for calibration. Finally, the accuracy and consistency of two proposed methods are evaluated with hand-held real-world datasets where traditional hand-eye calibration method do not work.
Paper Structure (19 sections, 2 theorems, 36 equations, 9 figures, 2 tables)

This paper contains 19 sections, 2 theorems, 36 equations, 9 figures, 2 tables.

Key Result

Lemma 5.1

If the frame $\{ M\}$ performs pure translation (no rotation) motion, ${}^C{p_M}$ is unobservable. The corresponding right null space of ${O}$ is:

Figures (9)

  • Figure 1: (a) Photo of the sensor setup, taken from schubert2018tum. (b) The spatial-temporal relationship between the camera measurements and the global pose measurements.
  • Figure 2: (a) Coordinate frames for the target-based method. (b) Coordinate frames for the target-less method.
  • Figure 3: Expected feature positions (green) and predicted feature positions (red) in the image.
  • Figure 4: Errors (solid lines) and $1\sigma$ bounds (dashed lines) of the spatial-temporal calibration parameters. $x$-axis represents time in seconds. Left to right corresponds to Case1 to Case5 in \ref{['Validation of the Observability Analysis']}. The estimation error of the rotation and temporal calibration parameters perfectly approach to zero for any cases. While the convergence results of the translation calibration parameter are varied from case to case.
  • Figure 5: $imu1$ is used. GT: groundtruth trajectory output from motion capture system. PnP: camera trajectory output from PnP algorithm. Ours: refined camera trajectory ${}_{{C_i}}^WT,{\rm{ }}i = 1 \cdots N$.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Lemma 5.1
  • proof
  • Lemma 5.2
  • proof