Table of Contents
Fetching ...

Visuo-Tactile Keypoint Correspondences for Object Manipulation

Jeong-Jung Kim, Doo-Yeol Koh, Chang-Hyun Kim

TL;DR

This work addresses precise object manipulation in unstructured environments by fusing visuo-tactile sensing to produce keypoint correspondences for pose estimation. It introduces a two-phase framework that uses dense descriptors from visuo-tactile images (via $DINO$ and ViT features) to compute a displacement $\Delta \mathbf{P}$ and drive pose adjustments without additional training. The approach demonstrates millimeter-level accuracy in gear insertion and block alignment on a GelSight Mini-equipped Franka robot, with an average keypoint error around $1.29$ mm and iterative refinement to improve reliability. The method reduces post-grasp adjustments and deployment complexity, though it relies on rough initial alignment and predefined keypoints, motivating future work on active alignment and automated keypoint selection for broader applicability.

Abstract

This paper presents a novel manipulation strategy that uses keypoint correspondences extracted from visuo-tactile sensor images to facilitate precise object manipulation. Our approach uses the visuo-tactile feedback to guide the robot's actions for accurate object grasping and placement, eliminating the need for post-grasp adjustments and extensive training. This method provides an improvement in deployment efficiency, addressing the challenges of manipulation tasks in environments where object locations are not predefined. We validate the effectiveness of our strategy through experiments demonstrating the extraction of keypoint correspondences and their application to real-world tasks such as block alignment and gear insertion, which require millimeter-level precision. The results show an average error margin significantly lower than that of traditional vision-based methods, which is sufficient to achieve the target tasks.

Visuo-Tactile Keypoint Correspondences for Object Manipulation

TL;DR

This work addresses precise object manipulation in unstructured environments by fusing visuo-tactile sensing to produce keypoint correspondences for pose estimation. It introduces a two-phase framework that uses dense descriptors from visuo-tactile images (via and ViT features) to compute a displacement and drive pose adjustments without additional training. The approach demonstrates millimeter-level accuracy in gear insertion and block alignment on a GelSight Mini-equipped Franka robot, with an average keypoint error around mm and iterative refinement to improve reliability. The method reduces post-grasp adjustments and deployment complexity, though it relies on rough initial alignment and predefined keypoints, motivating future work on active alignment and automated keypoint selection for broader applicability.

Abstract

This paper presents a novel manipulation strategy that uses keypoint correspondences extracted from visuo-tactile sensor images to facilitate precise object manipulation. Our approach uses the visuo-tactile feedback to guide the robot's actions for accurate object grasping and placement, eliminating the need for post-grasp adjustments and extensive training. This method provides an improvement in deployment efficiency, addressing the challenges of manipulation tasks in environments where object locations are not predefined. We validate the effectiveness of our strategy through experiments demonstrating the extraction of keypoint correspondences and their application to real-world tasks such as block alignment and gear insertion, which require millimeter-level precision. The results show an average error margin significantly lower than that of traditional vision-based methods, which is sufficient to achieve the target tasks.
Paper Structure (8 sections, 1 equation, 8 figures, 1 algorithm)

This paper contains 8 sections, 1 equation, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Displacement estimation based on keypoint correspondences from visuo-tactile sensor images for pose adjustment in robot manipulation
  • Figure 2: Manipulation process using keypoints extracted from visuo-tactile sensor images. The correspondence points between the sensor image obtained from human demonstration and the image captured during actual execution are identified. Displacement is calculated using this correspondence, and a pose adjustment is performed based on the value.
  • Figure 3: Experimental setup. A GelSight Mini sensor, which is a visuo-tactile sensor, is attached to the end-effector of the Franka Emika Panda robot to acquire sensor data and estimate displacement.
  • Figure 4: Objects for gear insertion task. A robot picks up gears and inserts them into holes on a panel.
  • Figure 5: Example of successful keypoint correspondence. Keypoint matching has been performed, associating the left corner of the object in the goal image with the left corner of the object in a captured tactile sensor data.
  • ...and 3 more figures