Table of Contents
Fetching ...

3D Foundation Models Enable Simultaneous Geometry and Pose Estimation of Grasped Objects

Weiming Zhi, Haozhan Tang, Tianyi Zhang, Matthew Johnson-Roberson

TL;DR

The paper tackles the problem of estimating both the geometry and 6D pose of objects grasped by a robot from a small set of RGB images captured by an uncalibrated external camera. It introduces the Geometry and Pose Estimation (GPE) framework, which leverages 3D foundation models (DUSt3R) to obtain initial camera-frame, unscaled estimates and then solves a coordinate-alignment problem to recover metric scale and transform the object into the robot base frame, enabling forward kinematics-based mappings to points on the object. The key contributions are a unified GPE approach, a coordinate-alignment formulation that bridges foundation-model outputs with robot coordinates, and empirical validation on diverse real-world objects under limited data. The results demonstrate accurate geometry and pose estimation and show how these estimates can be used to shape robot trajectories relative to features on the held object, with implications for tool-use and manipulation in cluttered or uncalibrated settings.

Abstract

Humans have the remarkable ability to use held objects as tools to interact with their environment. For this to occur, humans internally estimate how hand movements affect the object's movement. We wish to endow robots with this capability. We contribute methodology to jointly estimate the geometry and pose of objects grasped by a robot, from RGB images captured by an external camera. Notably, our method transforms the estimated geometry into the robot's coordinate frame, while not requiring the extrinsic parameters of the external camera to be calibrated. Our approach leverages 3D foundation models, large models pre-trained on huge datasets for 3D vision tasks, to produce initial estimates of the in-hand object. These initial estimations do not have physically correct scales and are in the camera's frame. Then, we formulate, and efficiently solve, a coordinate-alignment problem to recover accurate scales, along with a transformation of the objects to the coordinate frame of the robot. Forward kinematics mappings can subsequently be defined from the manipulator's joint angles to specified points on the object. These mappings enable the estimation of points on the held object at arbitrary configurations, enabling robot motion to be designed with respect to coordinates on the grasped objects. We empirically evaluate our approach on a robot manipulator holding a diverse set of real-world objects.

3D Foundation Models Enable Simultaneous Geometry and Pose Estimation of Grasped Objects

TL;DR

The paper tackles the problem of estimating both the geometry and 6D pose of objects grasped by a robot from a small set of RGB images captured by an uncalibrated external camera. It introduces the Geometry and Pose Estimation (GPE) framework, which leverages 3D foundation models (DUSt3R) to obtain initial camera-frame, unscaled estimates and then solves a coordinate-alignment problem to recover metric scale and transform the object into the robot base frame, enabling forward kinematics-based mappings to points on the object. The key contributions are a unified GPE approach, a coordinate-alignment formulation that bridges foundation-model outputs with robot coordinates, and empirical validation on diverse real-world objects under limited data. The results demonstrate accurate geometry and pose estimation and show how these estimates can be used to shape robot trajectories relative to features on the held object, with implications for tool-use and manipulation in cluttered or uncalibrated settings.

Abstract

Humans have the remarkable ability to use held objects as tools to interact with their environment. For this to occur, humans internally estimate how hand movements affect the object's movement. We wish to endow robots with this capability. We contribute methodology to jointly estimate the geometry and pose of objects grasped by a robot, from RGB images captured by an external camera. Notably, our method transforms the estimated geometry into the robot's coordinate frame, while not requiring the extrinsic parameters of the external camera to be calibrated. Our approach leverages 3D foundation models, large models pre-trained on huge datasets for 3D vision tasks, to produce initial estimates of the in-hand object. These initial estimations do not have physically correct scales and are in the camera's frame. Then, we formulate, and efficiently solve, a coordinate-alignment problem to recover accurate scales, along with a transformation of the objects to the coordinate frame of the robot. Forward kinematics mappings can subsequently be defined from the manipulator's joint angles to specified points on the object. These mappings enable the estimation of points on the held object at arbitrary configurations, enabling robot motion to be designed with respect to coordinates on the grasped objects. We empirically evaluate our approach on a robot manipulator holding a diverse set of real-world objects.
Paper Structure (13 sections, 12 equations, 10 figures, 2 tables)

This paper contains 13 sections, 12 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: We jointly estimate the geometry and the pose of the in-hand object from RGB images taken by an uncalibrated external camera. This setup is shown on the left. This enables us to produce a reconstruction (right) of the gripper with the held hammer, and transform it to the coordinate frame of the robot.
  • Figure 2: (Top) Examples of images of a robot gripper holding a toy wrench with masked white backgrounds. A total of nine such images are inputted into the 3D foundation model. (Bottom-left) The structure from motion and multi-view stereo solution estimates a dense reconstruction and camera poses. This includes a reconstruction in a single pose and cameras at nine poses. (Bottom-right) We can instead recover the object pose estimation and dense reconstruction solution by assuming a fixed camera. This includes a camera at a single pose with reconstructions at nine poses. All cameras are illustrated as green cones. The dense reconstructions have been down-sampled to enable efficient visualisation.
  • Figure 3: We define the transformations $^{E}T_{B}$, $^{O}T_{E}$, $^{C}T_{O}$, $^{C}T_{B}$ between robot's base, end-effector, the object and the camera.
  • Figure 4: Each ${[^{C}T_{O}]}_{n}$ is connected to the other $^{C}T_{O}$ transforms by pre-multiplying $(H^{-1}A_{n,m}H)$ to ${[^{C}T_{O}]}_{m}$, where $m$ denotes the index of the other $^{C}T_{O}$ matrices. We can then represent $[^{C}T_{O}]_{n}$ by averaging the incoming results from all other ${[^{C}T_{O}]}$.
  • Figure 5: We minimise the distances between the dense reconstructions, transformed under the estimated $f_{i}$ and $\hat{P}_{i}^{-1}$ rendered into the camera's view in 2D. (Left) We show an example of the dense projected by $f_{i}$ illustrated with a small selected set of sampled points in blue. We have the points under the $\hat{P}_{i}^{-1}$ transformation, corresponding to the blue samples shown in green, overlaid on a ground-truth image. The distances between the correspondences are shown as cyan lines. (Right) After optimising until convergence, distances between the illustrated selected points are minimal. The estimation projected into the camera's view almost entirely aligns and overlaps with the ground-truth image.
  • ...and 5 more figures