3D Foundation Models Enable Simultaneous Geometry and Pose Estimation of Grasped Objects
Weiming Zhi, Haozhan Tang, Tianyi Zhang, Matthew Johnson-Roberson
TL;DR
The paper tackles the problem of estimating both the geometry and 6D pose of objects grasped by a robot from a small set of RGB images captured by an uncalibrated external camera. It introduces the Geometry and Pose Estimation (GPE) framework, which leverages 3D foundation models (DUSt3R) to obtain initial camera-frame, unscaled estimates and then solves a coordinate-alignment problem to recover metric scale and transform the object into the robot base frame, enabling forward kinematics-based mappings to points on the object. The key contributions are a unified GPE approach, a coordinate-alignment formulation that bridges foundation-model outputs with robot coordinates, and empirical validation on diverse real-world objects under limited data. The results demonstrate accurate geometry and pose estimation and show how these estimates can be used to shape robot trajectories relative to features on the held object, with implications for tool-use and manipulation in cluttered or uncalibrated settings.
Abstract
Humans have the remarkable ability to use held objects as tools to interact with their environment. For this to occur, humans internally estimate how hand movements affect the object's movement. We wish to endow robots with this capability. We contribute methodology to jointly estimate the geometry and pose of objects grasped by a robot, from RGB images captured by an external camera. Notably, our method transforms the estimated geometry into the robot's coordinate frame, while not requiring the extrinsic parameters of the external camera to be calibrated. Our approach leverages 3D foundation models, large models pre-trained on huge datasets for 3D vision tasks, to produce initial estimates of the in-hand object. These initial estimations do not have physically correct scales and are in the camera's frame. Then, we formulate, and efficiently solve, a coordinate-alignment problem to recover accurate scales, along with a transformation of the objects to the coordinate frame of the robot. Forward kinematics mappings can subsequently be defined from the manipulator's joint angles to specified points on the object. These mappings enable the estimation of points on the held object at arbitrary configurations, enabling robot motion to be designed with respect to coordinates on the grasped objects. We empirically evaluate our approach on a robot manipulator holding a diverse set of real-world objects.
