Table of Contents
Fetching ...

AGILE: Approach-based Grasp Inference Learned from Element Decomposition

MohammadHossein Koosheshi, Hamed Hosseini, Mehdi Tale Masouleh, Ahmad Kalhor, Mohammad Reza Hairi Yazdi

TL;DR

AGILE tackles robotic grasping by leveraging hand-object approach information and explicit object element decomposition. It proposes a two-stage pipeline with a Mask R-CNN-based element decomposer and an approach-conditioned grasp detector that regresses the grasp rectangle $(x,y,\theta,w)$, trained on a novel Coppeliasim dataset with 10 objects and element masks. In simulation, the method achieves 90% success on seen objects and 78% on unseen objects, and sim-to-real adaptation yields about 70% physical grasp success on a Delta parallel robot with a 2-finger gripper, demonstrating notable generalization and a path toward real-world deployment. The work contributes a public dataset and a practical pipeline for approach-aware, element-based grasp inference, while identifying improvements such as multi-view sensing and larger object sets to close the sim-to-real gap.

Abstract

Humans, this species expert in grasp detection, can grasp objects by taking into account hand-object positioning information. This work proposes a method to enable a robot manipulator to learn the same, grasping objects in the most optimal way according to how the gripper has approached the object. Built on deep learning, the proposed method consists of two main stages. In order to generalize the network on unseen objects, the proposed Approach-based Grasping Inference involves an element decomposition stage to split an object into its main parts, each with one or more annotated grasps for a particular approach of the gripper. Subsequently, a grasp detection network utilizes the decomposed elements by Mask R-CNN and the information on the approach of the gripper in order to detect the element the gripper has approached and the most optimal grasp. In order to train the networks, the study introduces a robotic grasping dataset collected in the Coppeliasim simulation environment. The dataset involves 10 different objects with annotated element decomposition masks and grasp rectangles. The proposed method acquires a 90% grasp success rate on seen objects and 78% on unseen objects in the Coppeliasim simulation environment. Lastly, simulation-to-reality domain adaptation is performed by applying transformations on the training set collected in simulation and augmenting the dataset, which results in a 70% physical grasp success performance using a Delta parallel robot and a 2 -fingered gripper.

AGILE: Approach-based Grasp Inference Learned from Element Decomposition

TL;DR

AGILE tackles robotic grasping by leveraging hand-object approach information and explicit object element decomposition. It proposes a two-stage pipeline with a Mask R-CNN-based element decomposer and an approach-conditioned grasp detector that regresses the grasp rectangle , trained on a novel Coppeliasim dataset with 10 objects and element masks. In simulation, the method achieves 90% success on seen objects and 78% on unseen objects, and sim-to-real adaptation yields about 70% physical grasp success on a Delta parallel robot with a 2-finger gripper, demonstrating notable generalization and a path toward real-world deployment. The work contributes a public dataset and a practical pipeline for approach-aware, element-based grasp inference, while identifying improvements such as multi-view sensing and larger object sets to close the sim-to-real gap.

Abstract

Humans, this species expert in grasp detection, can grasp objects by taking into account hand-object positioning information. This work proposes a method to enable a robot manipulator to learn the same, grasping objects in the most optimal way according to how the gripper has approached the object. Built on deep learning, the proposed method consists of two main stages. In order to generalize the network on unseen objects, the proposed Approach-based Grasping Inference involves an element decomposition stage to split an object into its main parts, each with one or more annotated grasps for a particular approach of the gripper. Subsequently, a grasp detection network utilizes the decomposed elements by Mask R-CNN and the information on the approach of the gripper in order to detect the element the gripper has approached and the most optimal grasp. In order to train the networks, the study introduces a robotic grasping dataset collected in the Coppeliasim simulation environment. The dataset involves 10 different objects with annotated element decomposition masks and grasp rectangles. The proposed method acquires a 90% grasp success rate on seen objects and 78% on unseen objects in the Coppeliasim simulation environment. Lastly, simulation-to-reality domain adaptation is performed by applying transformations on the training set collected in simulation and augmenting the dataset, which results in a 70% physical grasp success performance using a Delta parallel robot and a 2 -fingered gripper.
Paper Structure (7 sections, 2 equations, 8 figures, 5 tables)

This paper contains 7 sections, 2 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of the proposed Approach-based Grasp Inference method by decomposing the elements of the object. Starting from the top view of the object, Mask R-CNN decomposes the object into its primitive elements. The decomposed parts are fed into a grasp detection network, which decides upon the best graspable part by also consuming the information on the approach of the gripper and detecting the most optimal grasp based on the presented approach information.
  • Figure 2: Collecting approach-based grasping dataset in the simulation environment. (a) shows how the vision sensors and the object are placed in the environment, without the approach having been made, with the recorded image of the object from the top-view camera in the below row. The top row of (b) shows the scene when the approach has been made, and the below row shows the recorded image of the approach of the gripper by the isometric camera shown in (a). The top row of (c) illustrates the gripper in the grasping pose, and the recorded grasp in the below row.
  • Figure 3: Samples from the training set of the proposed grasping dataset with grasp rectangle annotations. One can see that different approaches have led to different grasps of the objects.
  • Figure 4: Samples of object images with augmentation transformations applied and their corresponding element decomposition masks.
  • Figure 5: Grasp detection CNN. Decomposed by Mask R-CNN, the elements and the image representing the approach of the gripper are fed into the network to find the best grasping part. The network processes the feature maps to decide upon the most optimal decomposed element.
  • ...and 3 more figures