Table of Contents
Fetching ...

Learning a visuomotor controller for real world robotic grasping using simulated depth images

Ulrich Viereck, Andreas ten Pas, Kate Saenko, Robert Platt

TL;DR

This work introduces a closed-loop visuomotor controller for robotic grasping that uses depth images from a wrist-mounted sensor and a CNN that predicts distance-to-nearest-grasp. Training data are generated entirely in simulation via OpenRAVE, mapping depth-action pairs to an L1-optimized distance function, which the controller uses to iteratively approach a grasp. The approach transfers well to real sensors and outperforms a strong one-shot baseline under kinematic noise and perceptual disturbances, with notable gains in dynamic scenes. The results demonstrate the practicality of sim-to-real, depth-based, feedback-controlled grasping in cluttered and shifting environments, while suggesting avenues for faster corrections and deployment on noisier hardware.

Abstract

We want to build robots that are useful in unstructured real world applications, such as doing work in the household. Grasping in particular is an important skill in this domain, yet it remains a challenge. One of the key hurdles is handling unexpected changes or motion in the objects being grasped and kinematic noise or other errors in the robot. This paper proposes an approach to learning a closed-loop controller for robotic grasping that dynamically guides the gripper to the object. We use a wrist-mounted sensor to acquire depth images in front of the gripper and train a convolutional neural network to learn a distance function to true grasps for grasp configurations over an image. The training sensor data is generated in simulation, a major advantage over previous work that uses real robot experience, which is costly to obtain. Despite being trained in simulation, our approach works well on real noisy sensor images. We compare our controller in simulated and real robot experiments to a strong baseline for grasp pose detection, and find that our approach significantly outperforms the baseline in the presence of kinematic noise, perceptual errors and disturbances of the object during grasping.

Learning a visuomotor controller for real world robotic grasping using simulated depth images

TL;DR

This work introduces a closed-loop visuomotor controller for robotic grasping that uses depth images from a wrist-mounted sensor and a CNN that predicts distance-to-nearest-grasp. Training data are generated entirely in simulation via OpenRAVE, mapping depth-action pairs to an L1-optimized distance function, which the controller uses to iteratively approach a grasp. The approach transfers well to real sensors and outperforms a strong one-shot baseline under kinematic noise and perceptual disturbances, with notable gains in dynamic scenes. The results demonstrate the practicality of sim-to-real, depth-based, feedback-controlled grasping in cluttered and shifting environments, while suggesting avenues for faster corrections and deployment on noisier hardware.

Abstract

We want to build robots that are useful in unstructured real world applications, such as doing work in the household. Grasping in particular is an important skill in this domain, yet it remains a challenge. One of the key hurdles is handling unexpected changes or motion in the objects being grasped and kinematic noise or other errors in the robot. This paper proposes an approach to learning a closed-loop controller for robotic grasping that dynamically guides the gripper to the object. We use a wrist-mounted sensor to acquire depth images in front of the gripper and train a convolutional neural network to learn a distance function to true grasps for grasp configurations over an image. The training sensor data is generated in simulation, a major advantage over previous work that uses real robot experience, which is costly to obtain. Despite being trained in simulation, our approach works well on real noisy sensor images. We compare our controller in simulated and real robot experiments to a strong baseline for grasp pose detection, and find that our approach significantly outperforms the baseline in the presence of kinematic noise, perceptual errors and disturbances of the object during grasping.

Paper Structure

This paper contains 14 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: Our controller makes dynamic corrections while grasping using depth image feedback from a sensor mounted to the robot's wrist. (a) The hand has moved to the initial detected grasping position for the flashlight. (b) The flashlight has shifted and the hand became misaligned with the object. (c) The controller has corrected for the misalignment and has moved the hand into a good grasp pose. The controller is now ready to pick up the flashlight. (d) - (f) show the corresponding depth image. The green lines show initial grasps predicted by the CNN. The red line shows the current gripper pose.
  • Figure 2: Overview of our approach. The training data is generated in an OpenRAVE simulator (\ref{['dataset']}). A CNN model is trained to predict distance to nearest grasps (\ref{['model']}). A controller moves the gripper to predicted good grasp poses (\ref{['controller']}).
  • Figure 3: Calculating the distance-to-nearest-grasp for two different offset poses (shown in red and blue). During creation of the training set, we estimate the distance between each of these pose offsets and the nearest ground truth grasp (shown in green).
  • Figure 4: Illustration for how the controller works in a 1-dimensional case with two objects (control in x-axis direction). Although the global best prediction for the grasp pose belongs to the object on the left, the controller moves to the closer object on the right, because it follows the direction of the local gradient near the center of the image.
  • Figure 5: Histogram of distances of predicted grasps to closest true grasp for 400 simulated trials for various scenarios (bin size = 3 cm). Left plot shows that our approach (CTR) compensates well for movement noise of the gripper, where the baseline method (GPD) fails to compensate. Right plot shows that our closed-loop controller compensates for perceptual errors made in the first images by making corrections based on new images while moving to the grasp.
  • ...and 3 more figures