Table of Contents
Fetching ...

Improving the Successful Robotic Grasp Detection Using Convolutional Neural Networks

Hamed Hosseini, Mehdi Tale Masouleh, Ahmad Kalhor

TL;DR

This work tackles real-time robotic grasp detection by regressing RGB-D input to a rectangle grasp representation $g=\{x,y,\theta,w,h\}$, or its orientation-encoded form with $\sin\theta$ and $\cos\theta$. A two-stage CNN pipeline using transfer-learned feature extractors outputs a 6D vector $\hat{t}= (\hat{x},\hat{y},\hat{\sin\theta},\hat{\cos\theta},\hat{w},\hat{h})$, with 4-channel RGB-D input and normalization strategies to stabilize training. Key contributions include data augmentation (rotation and zoom), output normalization, and the finding that depth information significantly boosts Jaccard-based grasp accuracy; among tested backbones, AlexNet provided the best real-time performance. Evaluations on the Cornell dataset show competitive grasp detection accuracy, and results point to practical deployment opportunities with ROS/Gazebo and potential integration of force sensing. The approach advances robust, fast grasp perception for unobserved objects in robotic manipulation tasks.

Abstract

Robotic grasp should be carried out in a real-time manner by proper accuracy. Perception is the first and significant step in this procedure. This paper proposes an improved pipeline model trying to detect grasp as a rectangle representation for different seen or unseen objects. It helps the robot to start control procedures from nearer to the proper part of the object. The main idea consists in pre-processing, output normalization, and data augmentation to improve accuracy by 4.3 percent without making the system slow. Also, a comparison has been conducted over different pre-trained models like AlexNet, ResNet, Vgg19, which are the most famous feature extractors for image processing in object detection. Although AlexNet has less complexity than other ones, it outperformed them, which helps the real-time property.

Improving the Successful Robotic Grasp Detection Using Convolutional Neural Networks

TL;DR

This work tackles real-time robotic grasp detection by regressing RGB-D input to a rectangle grasp representation , or its orientation-encoded form with and . A two-stage CNN pipeline using transfer-learned feature extractors outputs a 6D vector , with 4-channel RGB-D input and normalization strategies to stabilize training. Key contributions include data augmentation (rotation and zoom), output normalization, and the finding that depth information significantly boosts Jaccard-based grasp accuracy; among tested backbones, AlexNet provided the best real-time performance. Evaluations on the Cornell dataset show competitive grasp detection accuracy, and results point to practical deployment opportunities with ROS/Gazebo and potential integration of force sensing. The approach advances robust, fast grasp perception for unobserved objects in robotic manipulation tasks.

Abstract

Robotic grasp should be carried out in a real-time manner by proper accuracy. Perception is the first and significant step in this procedure. This paper proposes an improved pipeline model trying to detect grasp as a rectangle representation for different seen or unseen objects. It helps the robot to start control procedures from nearer to the proper part of the object. The main idea consists in pre-processing, output normalization, and data augmentation to improve accuracy by 4.3 percent without making the system slow. Also, a comparison has been conducted over different pre-trained models like AlexNet, ResNet, Vgg19, which are the most famous feature extractors for image processing in object detection. Although AlexNet has less complexity than other ones, it outperformed them, which helps the real-time property.
Paper Structure (14 sections, 5 equations, 5 figures, 3 tables)

This paper contains 14 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Five-dimensional grasp representation, which consists of location, orientation and gripper's plates size.
  • Figure 2: Cornell Grasp Data Set: some of the Cornell grasp image data set with their multiple ground truth labels are shown. The blue channel is replaced with depth information.
  • Figure 3: Outline of the pipe-line model: RGB-D as input and grasp representation as output. Each block is extended by the same color.
  • Figure 4: Left green box contains correct predictions, and the red right box contains wrong predictions.
  • Figure 5: Decreasing train and validation loss function value among increasing training epochs.