Visuo-Tactile Keypoint Correspondences for Object Manipulation
Jeong-Jung Kim, Doo-Yeol Koh, Chang-Hyun Kim
TL;DR
This work addresses precise object manipulation in unstructured environments by fusing visuo-tactile sensing to produce keypoint correspondences for pose estimation. It introduces a two-phase framework that uses dense descriptors from visuo-tactile images (via $DINO$ and ViT features) to compute a displacement $\Delta \mathbf{P}$ and drive pose adjustments without additional training. The approach demonstrates millimeter-level accuracy in gear insertion and block alignment on a GelSight Mini-equipped Franka robot, with an average keypoint error around $1.29$ mm and iterative refinement to improve reliability. The method reduces post-grasp adjustments and deployment complexity, though it relies on rough initial alignment and predefined keypoints, motivating future work on active alignment and automated keypoint selection for broader applicability.
Abstract
This paper presents a novel manipulation strategy that uses keypoint correspondences extracted from visuo-tactile sensor images to facilitate precise object manipulation. Our approach uses the visuo-tactile feedback to guide the robot's actions for accurate object grasping and placement, eliminating the need for post-grasp adjustments and extensive training. This method provides an improvement in deployment efficiency, addressing the challenges of manipulation tasks in environments where object locations are not predefined. We validate the effectiveness of our strategy through experiments demonstrating the extraction of keypoint correspondences and their application to real-world tasks such as block alignment and gear insertion, which require millimeter-level precision. The results show an average error margin significantly lower than that of traditional vision-based methods, which is sufficient to achieve the target tasks.
