Table of Contents
Fetching ...

The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes?

Roberto Calandra, Andrew Owens, Manu Upadhyaya, Wenzhen Yuan, Justin Lin, Edward H. Adelson, Sergey Levine

TL;DR

The paper tackles predicting grasp outcomes for a two-finger robot gripper by fusing high-resolution GelSight tactile sensing with RGB vision in an end-to-end learning framework. It introduces a late-fusion CNN that computes the grasp-success probability $y=f(x)$ from inputs $x=(I_{RGB}, I_{GelSightL}, I_{GelSightR})$, incorporating temporal cues such as $I_{T_a}$, $I_{T_b}$, and the GelSight difference $I_{T_b}-I_{T_a}$ to produce $y=f(x)$. On a dataset of 9,269 grasps across 106 objects, tactile and visuo-tactile models outperform vision-only baselines, with the full visuo-tactile model achieving the best predictive accuracy. In real-world grasping on 12 unseen objects, the visuo-tactile model achieved 94% success compared to 80% for vision-only, demonstrating practical benefits for grasp planning; the study also discusses limitations and future directions for more efficient tactile integration.

Abstract

A successful grasp requires careful balancing of the contact forces. Deducing whether a particular grasp will be successful from indirect measurements, such as vision, is therefore quite challenging, and direct sensing of contacts through touch sensing provides an appealing avenue toward more successful and consistent robotic grasping. However, in order to fully evaluate the value of touch sensing for grasp outcome prediction, we must understand how touch sensing can influence outcome prediction accuracy when combined with other modalities. Doing so using conventional model-based techniques is exceptionally difficult. In this work, we investigate the question of whether touch sensing aids in predicting grasp outcomes within a multimodal sensing framework that combines vision and touch. To that end, we collected more than 9,000 grasping trials using a two-finger gripper equipped with GelSight high-resolution tactile sensors on each finger, and evaluated visuo-tactile deep neural network models to directly predict grasp outcomes from either modality individually, and from both modalities together. Our experimental results indicate that incorporating tactile readings substantially improve grasping performance.

The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes?

TL;DR

The paper tackles predicting grasp outcomes for a two-finger robot gripper by fusing high-resolution GelSight tactile sensing with RGB vision in an end-to-end learning framework. It introduces a late-fusion CNN that computes the grasp-success probability from inputs , incorporating temporal cues such as , , and the GelSight difference to produce . On a dataset of 9,269 grasps across 106 objects, tactile and visuo-tactile models outperform vision-only baselines, with the full visuo-tactile model achieving the best predictive accuracy. In real-world grasping on 12 unseen objects, the visuo-tactile model achieved 94% success compared to 80% for vision-only, demonstrating practical benefits for grasp planning; the study also discusses limitations and future directions for more efficient tactile integration.

Abstract

A successful grasp requires careful balancing of the contact forces. Deducing whether a particular grasp will be successful from indirect measurements, such as vision, is therefore quite challenging, and direct sensing of contacts through touch sensing provides an appealing avenue toward more successful and consistent robotic grasping. However, in order to fully evaluate the value of touch sensing for grasp outcome prediction, we must understand how touch sensing can influence outcome prediction accuracy when combined with other modalities. Doing so using conventional model-based techniques is exceptionally difficult. In this work, we investigate the question of whether touch sensing aids in predicting grasp outcomes within a multimodal sensing framework that combines vision and touch. To that end, we collected more than 9,000 grasping trials using a two-finger gripper equipped with GelSight high-resolution tactile sensors on each finger, and evaluated visuo-tactile deep neural network models to directly predict grasp outcomes from either modality individually, and from both modalities together. Our experimental results indicate that incorporating tactile readings substantially improve grasping performance.

Paper Structure

This paper contains 16 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The use of tactile sensors can greatly improve robot grasping capabilities. In our experiments, we used two GelSight sensors mounted on a parallel jaw gripper.
  • Figure 2: Examples of raw tactile data collected by the GelSight for different training objects.
  • Figure 3: Diagram of our visual-tactile multi-modal model. At grasping time (before attempting to lift the object), the RGB images from the front camera and the GelSight sensors images are fed to a deep neural network which predict whether the grasping will be successful or not. In the network, the data from each of the sensors is first passed into a convolutional neural network, and the resulting features are concatenated as the input into a fully-connected network.
  • Figure 4: Chronology of a data collection trial, with the various grasping phases, and the three snapshot points $T_a, T_b, T_c$.
  • Figure 5: Examples of training objects. Overall, 106 objects were used to collect 9269 grasps.