Table of Contents
Fetching ...

Get a Grip: Multi-Finger Grasp Evaluation at Scale Enables Robust Sim-to-Real Transfer

Tyler Ga Wei Lum, Albert H. Li, Preston Culbertson, Krishnan Srinivasan, Aaron D. Ames, Mac Schwager, Jeannette Bohg

TL;DR

It is found that existing datasets and methods have been insufficient for training discriminitive models for multi-finger grasping, and it is shown that the key factor for performance is indeed the evaluator, and that its quality degrades as the dataset shrinks, demonstrating the importance of the new dataset.

Abstract

This work explores conditions under which multi-finger grasping algorithms can attain robust sim-to-real transfer. While numerous large datasets facilitate learning generative models for multi-finger grasping at scale, reliable real-world dexterous grasping remains challenging, with most methods degrading when deployed on hardware. An alternate strategy is to use discriminative grasp evaluation models for grasp selection and refinement, conditioned on real-world sensor measurements. This paradigm has produced state-of-the-art results for vision-based parallel-jaw grasping, but remains unproven in the multi-finger setting. In this work, we find that existing datasets and methods have been insufficient for training discriminitive models for multi-finger grasping. To train grasp evaluators at scale, datasets must provide on the order of millions of grasps, including both positive and negative examples, with corresponding visual data resembling measurements at inference time. To that end, we release a new, open-source dataset of 3.5M grasps on 4.3K objects annotated with RGB images, point clouds, and trained NeRFs. Leveraging this dataset, we train vision-based grasp evaluators that outperform both analytic and generative modeling-based baselines on extensive simulated and real-world trials across a diverse range of objects. We show via numerous ablations that the key factor for performance is indeed the evaluator, and that its quality degrades as the dataset shrinks, demonstrating the importance of our new dataset. Project website at: https://sites.google.com/view/get-a-grip-dataset.

Get a Grip: Multi-Finger Grasp Evaluation at Scale Enables Robust Sim-to-Real Transfer

TL;DR

It is found that existing datasets and methods have been insufficient for training discriminitive models for multi-finger grasping, and it is shown that the key factor for performance is indeed the evaluator, and that its quality degrades as the dataset shrinks, demonstrating the importance of the new dataset.

Abstract

This work explores conditions under which multi-finger grasping algorithms can attain robust sim-to-real transfer. While numerous large datasets facilitate learning generative models for multi-finger grasping at scale, reliable real-world dexterous grasping remains challenging, with most methods degrading when deployed on hardware. An alternate strategy is to use discriminative grasp evaluation models for grasp selection and refinement, conditioned on real-world sensor measurements. This paradigm has produced state-of-the-art results for vision-based parallel-jaw grasping, but remains unproven in the multi-finger setting. In this work, we find that existing datasets and methods have been insufficient for training discriminitive models for multi-finger grasping. To train grasp evaluators at scale, datasets must provide on the order of millions of grasps, including both positive and negative examples, with corresponding visual data resembling measurements at inference time. To that end, we release a new, open-source dataset of 3.5M grasps on 4.3K objects annotated with RGB images, point clouds, and trained NeRFs. Leveraging this dataset, we train vision-based grasp evaluators that outperform both analytic and generative modeling-based baselines on extensive simulated and real-world trials across a diverse range of objects. We show via numerous ablations that the key factor for performance is indeed the evaluator, and that its quality degrades as the dataset shrinks, demonstrating the importance of our new dataset. Project website at: https://sites.google.com/view/get-a-grip-dataset.

Paper Structure

This paper contains 34 sections, 2 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Data and label generation. Our dataset supplies the components for robust sim-to-real transfer. For each object in the dataset, we provide RGB images of it from several random views, a full point cloud, and trained NeRF weights, which allows our data to be compatible with most vision-based methods. We also supply hundreds of grasps per object, each of which is simulated in Isaac Gym multiple times with slight wrist pose perturbations. Averaging, this yields three smooth regression targets indicating the probability of (1) unwanted collisions, (2) simulated pick success, and (3) grasp success, the logical conjunction of (2) and (3).
  • Figure 2: Violin plots showing sim results over hundreds of unseen test objects. The short lines mark the median and IQR. "Diffusion" vs. "Fixed" refers to sampling initial grasp $\mathcal{G}^{(0)}$ from a diffusion model or a fixed set of grasps. "NeRF" vs. "BPS" refers to whether the object $\mathcal{O}$ is given by NeRF or basis point set features. (A) Independent of the choice of sampler or object representation, evaluator refinement (right 4 columns) drastically improves planned grasp quality compared to sampling without refinement (left 2 columns). (B) Ablations on dataset size used to train the evaluator, which demonstrates a strong correlation between dataset size and grasp planner performance. The percentages indicate the fraction of the dataset used to train the evaluator.
  • Figure 3: Hardware setup. Our robot consists of an Allegro hand and ZED 2i camera mounted onto a Franka Research 3. We select 20 objects categorized as easy, medium, or hard based on the complexity of their geometry, the presence of "distractor" geometry, and/or resemblance to objects in the training dataset (see App. \ref{['app:object_selection']} for details). We selected these objects to maximize size/shape diversity while providing a clear upper bound on expected performance via the hard objects. The bottom row shows representative successes and failures for the "Fixed/BPS" configuration.
  • Figure 4: Average grasping success rates on hardware across object difficulties. We find that the methods that leverage evaluators (hatched) outperform both analytic and generative-modeling-only baselines. Detailed per-object success rates are shown in Appendix \ref{['sm:detailed_hardware_results']}.
  • Figure 5: Success rates for all simulated evaluator-based methods. When only considering grasps with $y_\text{coll} \geq 0.8$, the median success rate increases from 37% to 80%. 2100/4240 grasps satisfied $y_\text{coll} \geq 0.8$. On the right, the unconditioned distribution is plotted with dashed lines for comparison.
  • ...and 16 more figures