Table of Contents
Fetching ...

Grasping Deformable Objects via Reinforcement Learning with Cross-Modal Attention to Visuo-Tactile Inputs

Yonghyun Lee, Sungeun Hong, Min-gu Kim, Gyeonghwan Kim, Changjoo Nam

TL;DR

This work tackles robustly grasping deformable objects with a simple gripper by fusing visuo-tactile sensing in a DRL framework. It introduces a cross-modal Spatio-Channel Attention (CSCA) encoder to selectively fuse visual segmentation masks and tactile inputs, and trains the encoder end-to-end with Soft Actor-Critic and data augmentation. Empirical results in a PyBullet+TACTO simulation show that the proposed DRQ-CMA method outperforms early/late fusion baselines and single-modality approaches, with demonstrated generalization to unseen motions and objects. The approach promises safer, more reliable manipulation of soft objects using relatively simple robotic hardware, and points toward future sim-to-real transfer and physical validations.

Abstract

We consider the problem of grasping deformable objects with soft shells using a robotic gripper. Such objects have a center-of-mass that changes dynamically and are fragile so prone to burst. Thus, it is difficult for robots to generate appropriate control inputs not to drop or break the object while performing manipulation tasks. Multi-modal sensing data could help understand the grasping state through global information (e.g., shapes, pose) from visual data and local information around the contact (e.g., pressure) from tactile data. Although they have complementary information that can be beneficial to use together, fusing them is difficult owing to their different properties. We propose a method based on deep reinforcement learning (DRL) that generates control inputs of a simple gripper from visuo-tactile sensing information. Our method employs a cross-modal attention module in the encoder network and trains it in a self-supervised manner using the loss function of the RL agent. With the multi-modal fusion, the proposed method can learn the representation for the DRL agent from the visuo-tactile sensory data. The experimental result shows that cross-modal attention is effective to outperform other early and late data fusion methods across different environments including unseen robot motions and objects.

Grasping Deformable Objects via Reinforcement Learning with Cross-Modal Attention to Visuo-Tactile Inputs

TL;DR

This work tackles robustly grasping deformable objects with a simple gripper by fusing visuo-tactile sensing in a DRL framework. It introduces a cross-modal Spatio-Channel Attention (CSCA) encoder to selectively fuse visual segmentation masks and tactile inputs, and trains the encoder end-to-end with Soft Actor-Critic and data augmentation. Empirical results in a PyBullet+TACTO simulation show that the proposed DRQ-CMA method outperforms early/late fusion baselines and single-modality approaches, with demonstrated generalization to unseen motions and objects. The approach promises safer, more reliable manipulation of soft objects using relatively simple robotic hardware, and points toward future sim-to-real transfer and physical validations.

Abstract

We consider the problem of grasping deformable objects with soft shells using a robotic gripper. Such objects have a center-of-mass that changes dynamically and are fragile so prone to burst. Thus, it is difficult for robots to generate appropriate control inputs not to drop or break the object while performing manipulation tasks. Multi-modal sensing data could help understand the grasping state through global information (e.g., shapes, pose) from visual data and local information around the contact (e.g., pressure) from tactile data. Although they have complementary information that can be beneficial to use together, fusing them is difficult owing to their different properties. We propose a method based on deep reinforcement learning (DRL) that generates control inputs of a simple gripper from visuo-tactile sensing information. Our method employs a cross-modal attention module in the encoder network and trains it in a self-supervised manner using the loss function of the RL agent. With the multi-modal fusion, the proposed method can learn the representation for the DRL agent from the visuo-tactile sensory data. The experimental result shows that cross-modal attention is effective to outperform other early and late data fusion methods across different environments including unseen robot motions and objects.

Paper Structure

This paper contains 13 sections, 11 equations, 7 figures.

Figures (7)

  • Figure 1: The grasping task of deformable objects. (Left) A two-jaw robotic gripper attached to a 6-DOF arm. While the arm moves, the goal is to generate the control input to the gripper to grasp the object stably and safely without dropping or breaking it. (Upper right) A concatenated image of two tactile images obtained from the sensors attached to the inside of the finger tips. The tactile data provide local information around the contact area. (Lower right) A segmentation mask from an RGB image. The visual data can provide global information about grasping.
  • Figure 2: The architecture of the cross-modal attention module proposed in zhang_spatio-channel_2022. The module consists of SCA block to extract global feature correlations of the multi-modal data, and the CFA block to integrate complementary features.
  • Figure 3: An example of the grasping task. (Left) The arm is controlled to move the gripper down to the object. (Mid) The agent begins to control the gripper to grasp the object. (Right) The arm is controlled to lift its gripper.
  • Figure 4: (Left) The overall structure of the cross-attention DRL framework (DRQ-CMA). The encoder learns the representation for the RL agent from the visuo-tactile information by selectively focusing on relevant data. (Right) The encoders used in DRQ-EF and DRQ-LF without the attention mechanism.
  • Figure 5: The deformable objects used in training and test. The two left objects are used in training. In tests, all objects are used including the four unseen objects. (Left) A cylindrical object and a hexahedron which weigh 0.5 kg. (Mid) The sizes of the objects are 0.8 times of the left ones. Weights are 0.25 kg for both. (Right) The sizes are 1.2 times of the left ones. Their weights are 2 kg.
  • ...and 2 more figures