Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models

Weiming Zhi; Haozhan Tang; Tianyi Zhang; Matthew Johnson-Roberson

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models

Weiming Zhi, Haozhan Tang, Tianyi Zhang, Matthew Johnson-Roberson

TL;DR

This paper introduces Joint Calibration and Representation (JCR), a method that uses 3D foundation models to simultaneously calibrate a manipulator-mounted RGB camera to the end-effector and build a physically scaled 3D environment representation in the robot base frame from a small set of images. By extracting dense, marker-free correspondences from foundation models and solving a hand-eye calibration problem with a scale recovery step, JCR yields accurate $T_c^e$ and $\lambda$, enabling collision-aware planning. Empirical results show that JCR achieves image-efficient calibration compared to COLMAP, recovers scale within a few percent even with few views, and constructs rich occupancy, segmentation, and color-aware maps. The approach promises practical applicability for low-cost robotic systems and sets the stage for handling dynamic scenes and uncertainty-aware representations in future work.

Abstract

Representing the environment is a central challenge in robotics, and is essential for effective decision-making. Traditionally, before capturing images with a manipulator-mounted camera, users need to calibrate the camera using a specific external marker, such as a checkerboard or AprilTag. However, recent advances in computer vision have led to the development of \emph{3D foundation models}. These are large, pre-trained neural networks that can establish fast and accurate multi-view correspondences with very few images, even in the absence of rich visual features. This paper advocates for the integration of 3D foundation models into scene representation approaches for robotic systems equipped with manipulator-mounted RGB cameras. Specifically, we propose the Joint Calibration and Representation (JCR) method. JCR uses RGB images, captured by a manipulator-mounted camera, to simultaneously construct an environmental representation and calibrate the camera relative to the robot's end-effector, in the absence of specific calibration markers. The resulting 3D environment representation is aligned with the robot's coordinate frame and maintains physically accurate scales. We demonstrate that JCR can build effective scene representations using a low-cost RGB camera attached to a manipulator, without prior calibration.

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models

TL;DR

and

, enabling collision-aware planning. Empirical results show that JCR achieves image-efficient calibration compared to COLMAP, recovers scale within a few percent even with few views, and constructs rich occupancy, segmentation, and color-aware maps. The approach promises practical applicability for low-cost robotic systems and sets the stage for handling dynamic scenes and uncertainty-aware representations in future work.

Abstract

Paper Structure (12 sections, 14 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 12 sections, 14 equations, 7 figures, 1 table, 1 algorithm.

Introduction
Related Work
Prelminaries: 3D Foundation Models for Dense Reconstruction
Joint Calibration and Representation
Overview
Calibration With Foundation Model Outputs
Map Construction with Foundation Model Outputs
Empirical Evaluations
Hand-Eye Calibration with JFR
Scale Recovery with JCR
Constructing Representations with JFR
Conclusions and Future Work

Figures (7)

Figure 1: Our presented JCR method jointly calibrates the camera and builds environment representations (which can capture occupancy, colour, and segmentation classes), from RGB images captured on a manipulator-mounted camera. Crucially, no external markers, such as AprilTags AprilTag, are required for the calibration. The representations are of true scale and in the coordinate frame of the robot.
Figure 2: An example of end-effector positions (shown as blue crosses) and corresponding camera poses (given by cones, with cone bases indicating camera orientation), before and after calibration and rescaling. The camera poses from the foundation model (subfigure a) do not correspond to the physical scale nor align with the end-effector. This is resolved by solving \ref{['eq:srp']} (subfigure b).
Figure 3: We attach an inexpensive and small RGB camera to our manipulator and take images at different end-effector poses. The transformation from the camera to the end-effector is not known in advance and is solved by our calibration.
Figure 4: Percentage error between true heights and recovered heights.
Figure 5: Top Row: Examples of images taken by our manipulated-mounted camera. Bottom Row: Environment representations built with JCR. We visualize by sampling points at regions with predicted high occupancy and assign the colours predicted by the representation at these points. These representations are in the coordinate frame of the robot with physically-accurate scales.
...and 2 more figures

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models

TL;DR

Abstract

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)