Table of Contents
Fetching ...

Stable Offline Hand-Eye Calibration for any Robot with Just One Mark

Sicheng Xie, Lingchen Meng, Zhiying Du, Shuyuan Tu, Haidong Cao, Jiaqi Leng, Zuxuan Wu, Yu-Gang Jiang

TL;DR

CalibAll tackles the lack of camera extrinsics in off-the-shelf robotic datasets by proposing a training-free, offline hand-eye calibration method that uses a single annotated mark. It combines vision foundation model-based cross-robot mark localization, temporal PnP for coarse initialization, and differentiable rendering to refine the extrinsic $T_b^c$, achieving robust performance across three robot platforms. The approach yields high-precision extrinsics (e.g., on DREAM-real, AUC $=97.64$, ADD $=0.008$ m) and provides useful auxiliary annotations such as depth maps, per-link masks, and 2D $EE F$ trajectories to support downstream tasks. This enables improved imitation-learning pipelines with camera-space actions and broader pretraining opportunities without robot-specific calibration data.

Abstract

Imitation learning has achieved remarkable success in a variety of robotic tasks by learning a mapping function from camera-space observations to robot-space actions. Recent work indicates that the use of robot-to-camera transformation information ({\ie}, camera extrinsics) benefits the learning process and produces better results. However, camera extrinsics are oftentimes unavailable and estimation methods usually suffer from local minima and poor generalizations. In this paper, we present CalibAll, a simple yet effective method that \textbf{requires only a single mark} and performs training-free, stable, and accurate camera extrinsic estimation across diverse robots and datasets through a coarse-to-fine calibration pipeline. In particular, we annotate a single mark on an end-effector (EEF), and leverage the correspondence ability emerged from vision foundation models (VFM) to automatically localize the corresponding mark across robots in diverse datasets. Using this mark, together with point tracking and the 3D EEF trajectory, we obtain a coarse camera extrinsic via temporal Perspective-n-Point (PnP). This estimate is further refined through a rendering-based optimization that aligns rendered and ground-true masks, yielding accurate and stable camera extrinsic. Experimental results demonstrate that our method outperforms state-of-the-art approaches, showing strong robustness and general effectiveness across three robot platforms. It also produces useful auxiliary annotations such as depth maps, link-wise masks, and end-effector 2D trajectories, which can further support downstream tasks.

Stable Offline Hand-Eye Calibration for any Robot with Just One Mark

TL;DR

CalibAll tackles the lack of camera extrinsics in off-the-shelf robotic datasets by proposing a training-free, offline hand-eye calibration method that uses a single annotated mark. It combines vision foundation model-based cross-robot mark localization, temporal PnP for coarse initialization, and differentiable rendering to refine the extrinsic , achieving robust performance across three robot platforms. The approach yields high-precision extrinsics (e.g., on DREAM-real, AUC , ADD m) and provides useful auxiliary annotations such as depth maps, per-link masks, and 2D trajectories to support downstream tasks. This enables improved imitation-learning pipelines with camera-space actions and broader pretraining opportunities without robot-specific calibration data.

Abstract

Imitation learning has achieved remarkable success in a variety of robotic tasks by learning a mapping function from camera-space observations to robot-space actions. Recent work indicates that the use of robot-to-camera transformation information ({\ie}, camera extrinsics) benefits the learning process and produces better results. However, camera extrinsics are oftentimes unavailable and estimation methods usually suffer from local minima and poor generalizations. In this paper, we present CalibAll, a simple yet effective method that \textbf{requires only a single mark} and performs training-free, stable, and accurate camera extrinsic estimation across diverse robots and datasets through a coarse-to-fine calibration pipeline. In particular, we annotate a single mark on an end-effector (EEF), and leverage the correspondence ability emerged from vision foundation models (VFM) to automatically localize the corresponding mark across robots in diverse datasets. Using this mark, together with point tracking and the 3D EEF trajectory, we obtain a coarse camera extrinsic via temporal Perspective-n-Point (PnP). This estimate is further refined through a rendering-based optimization that aligns rendered and ground-true masks, yielding accurate and stable camera extrinsic. Experimental results demonstrate that our method outperforms state-of-the-art approaches, showing strong robustness and general effectiveness across three robot platforms. It also produces useful auxiliary annotations such as depth maps, link-wise masks, and end-effector 2D trajectories, which can further support downstream tasks.

Paper Structure

This paper contains 22 sections, 7 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of CalibAll, which can automatically and training-free estimate the camera extrinsic for data from any robot types, along with providing additional notations with one mark.
  • Figure 2: Architecture overview of CalibAll, a simple yet effective method that requires only a single mark. It follows a coarse-to-fine calibration pipeline that achieves training-free, stable, and accurate camera extrinsic estimation across diverse datasets and robot platforms. CalibAll first use EEF Recognition to obtain the end-effector tracking point, then apply temporal PnP to estimate a coarse extrinsic, and finally perform extrinsic refinement to obtain an accurate result.
  • Figure 3: Detailed results of CalibAll on DREAM lee2020camera, the x-axis representing the number of iterations in rendering-based optimization method.
  • Figure 4: Qualitative result of EEF recognition on Franka, xArm and UR5e. The first row shows the heatmaps obtained from feature matching. The second row visualizes the selected tracking point based on the maximum similarity.
  • Figure 5: Qualitative result of coarse initialization and extrinsic refinement on Franka, xArm and UR5e. The first row presents the source RGB images. The second row shows the rendered result using the camera extrinsic obtained from the automatic coarse initialization approach. The last row shows the rendered result of final camera extrinsic produced by CalibAll.
  • ...and 5 more figures