Stable Offline Hand-Eye Calibration for any Robot with Just One Mark
Sicheng Xie, Lingchen Meng, Zhiying Du, Shuyuan Tu, Haidong Cao, Jiaqi Leng, Zuxuan Wu, Yu-Gang Jiang
TL;DR
CalibAll tackles the lack of camera extrinsics in off-the-shelf robotic datasets by proposing a training-free, offline hand-eye calibration method that uses a single annotated mark. It combines vision foundation model-based cross-robot mark localization, temporal PnP for coarse initialization, and differentiable rendering to refine the extrinsic $T_b^c$, achieving robust performance across three robot platforms. The approach yields high-precision extrinsics (e.g., on DREAM-real, AUC $=97.64$, ADD $=0.008$ m) and provides useful auxiliary annotations such as depth maps, per-link masks, and 2D $EE F$ trajectories to support downstream tasks. This enables improved imitation-learning pipelines with camera-space actions and broader pretraining opportunities without robot-specific calibration data.
Abstract
Imitation learning has achieved remarkable success in a variety of robotic tasks by learning a mapping function from camera-space observations to robot-space actions. Recent work indicates that the use of robot-to-camera transformation information ({\ie}, camera extrinsics) benefits the learning process and produces better results. However, camera extrinsics are oftentimes unavailable and estimation methods usually suffer from local minima and poor generalizations. In this paper, we present CalibAll, a simple yet effective method that \textbf{requires only a single mark} and performs training-free, stable, and accurate camera extrinsic estimation across diverse robots and datasets through a coarse-to-fine calibration pipeline. In particular, we annotate a single mark on an end-effector (EEF), and leverage the correspondence ability emerged from vision foundation models (VFM) to automatically localize the corresponding mark across robots in diverse datasets. Using this mark, together with point tracking and the 3D EEF trajectory, we obtain a coarse camera extrinsic via temporal Perspective-n-Point (PnP). This estimate is further refined through a rendering-based optimization that aligns rendered and ground-true masks, yielding accurate and stable camera extrinsic. Experimental results demonstrate that our method outperforms state-of-the-art approaches, showing strong robustness and general effectiveness across three robot platforms. It also produces useful auxiliary annotations such as depth maps, link-wise masks, and end-effector 2D trajectories, which can further support downstream tasks.
