Improving the Successful Robotic Grasp Detection Using Convolutional Neural Networks
Hamed Hosseini, Mehdi Tale Masouleh, Ahmad Kalhor
TL;DR
This work tackles real-time robotic grasp detection by regressing RGB-D input to a rectangle grasp representation $g=\{x,y,\theta,w,h\}$, or its orientation-encoded form with $\sin\theta$ and $\cos\theta$. A two-stage CNN pipeline using transfer-learned feature extractors outputs a 6D vector $\hat{t}= (\hat{x},\hat{y},\hat{\sin\theta},\hat{\cos\theta},\hat{w},\hat{h})$, with 4-channel RGB-D input and normalization strategies to stabilize training. Key contributions include data augmentation (rotation and zoom), output normalization, and the finding that depth information significantly boosts Jaccard-based grasp accuracy; among tested backbones, AlexNet provided the best real-time performance. Evaluations on the Cornell dataset show competitive grasp detection accuracy, and results point to practical deployment opportunities with ROS/Gazebo and potential integration of force sensing. The approach advances robust, fast grasp perception for unobserved objects in robotic manipulation tasks.
Abstract
Robotic grasp should be carried out in a real-time manner by proper accuracy. Perception is the first and significant step in this procedure. This paper proposes an improved pipeline model trying to detect grasp as a rectangle representation for different seen or unseen objects. It helps the robot to start control procedures from nearer to the proper part of the object. The main idea consists in pre-processing, output normalization, and data augmentation to improve accuracy by 4.3 percent without making the system slow. Also, a comparison has been conducted over different pre-trained models like AlexNet, ResNet, Vgg19, which are the most famous feature extractors for image processing in object detection. Although AlexNet has less complexity than other ones, it outperformed them, which helps the real-time property.
