Table of Contents
Fetching ...

Combining Shape Completion and Grasp Prediction for Fast and Versatile Grasping with a Multi-Fingered Hand

Matthias Humt, Dominik Winkelbauer, Ulrich Hillenbrand, Berthold Bäuml

TL;DR

This work tackles unknown-object grasping from limited sensing by coupling a fast implicit shape completion module with a data-driven, multi-finger grasp predictor. It introduces a synthetic-data generation and training strategy for realism and sim-to-real transfer, and a multi-stage grasp network that handles pose uncertainty and prediction ambiguities with a multi-head design. The approach achieves real-time performance (~1 s for generating 1000 grasps) on a real robot and demonstrates robust grasping across diverse household items, validating both the shape reconstruction and grasping components. Overall, the paper advances versatile grasping under partial observability by integrating high-fidelity shape completion with robust, fast grasp prediction for dexterous hands.

Abstract

Grasping objects with limited or no prior knowledge about them is a highly relevant skill in assistive robotics. Still, in this general setting, it has remained an open problem, especially when it comes to only partial observability and versatile grasping with multi-fingered hands. We present a novel, fast, and high fidelity deep learning pipeline consisting of a shape completion module that is based on a single depth image, and followed by a grasp predictor that is based on the predicted object shape. The shape completion network is based on VQDIF and predicts spatial occupancy values at arbitrary query points. As grasp predictor, we use our two-stage architecture that first generates hand poses using an autoregressive model and then regresses finger joint configurations per pose. Critical factors turn out to be sufficient data realism and augmentation, as well as special attention to difficult cases during training. Experiments on a physical robot platform demonstrate successful grasping of a wide range of household objects based on a depth image from a single viewpoint. The whole pipeline is fast, taking only about 1 s for completing the object's shape (0.7 s) and generating 1000 grasps (0.3 s).

Combining Shape Completion and Grasp Prediction for Fast and Versatile Grasping with a Multi-Fingered Hand

TL;DR

This work tackles unknown-object grasping from limited sensing by coupling a fast implicit shape completion module with a data-driven, multi-finger grasp predictor. It introduces a synthetic-data generation and training strategy for realism and sim-to-real transfer, and a multi-stage grasp network that handles pose uncertainty and prediction ambiguities with a multi-head design. The approach achieves real-time performance (~1 s for generating 1000 grasps) on a real robot and demonstrates robust grasping across diverse household items, validating both the shape reconstruction and grasping components. Overall, the paper advances versatile grasping under partial observability by integrating high-fidelity shape completion with robust, fast grasp prediction for dexterous hands.

Abstract

Grasping objects with limited or no prior knowledge about them is a highly relevant skill in assistive robotics. Still, in this general setting, it has remained an open problem, especially when it comes to only partial observability and versatile grasping with multi-fingered hands. We present a novel, fast, and high fidelity deep learning pipeline consisting of a shape completion module that is based on a single depth image, and followed by a grasp predictor that is based on the predicted object shape. The shape completion network is based on VQDIF and predicts spatial occupancy values at arbitrary query points. As grasp predictor, we use our two-stage architecture that first generates hand poses using an autoregressive model and then regresses finger joint configurations per pose. Critical factors turn out to be sufficient data realism and augmentation, as well as special attention to difficult cases during training. Experiments on a physical robot platform demonstrate successful grasping of a wide range of household objects based on a depth image from a single viewpoint. The whole pipeline is fast, taking only about 1 s for completing the object's shape (0.7 s) and generating 1000 grasps (0.3 s).
Paper Structure (14 sections, 2 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 2 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Grasping the YCB bleach bottle using our grasping pipeline: The object is first perceived using Agile Justin's Bauml2014 Kinect camera to obtain a single depth image. Afterward, the shape completion network predicts the full object shape, based on which the grasping network generates a stable grasp. The grasp is then executed on the real robot using whole-body motion planning Tenhumberg2022Tenhumberg2023 for positioning the hand relative to the object and a kinematically calibrated robotic system Tenhumberg2022aTenhumberg2023a
  • Figure 2: Our pipeline consists of two steps during inference. From a partial point cloud of an object obtained through rendering (training) or a depth sensor (inference), we use a shape completion network Yan2022ShapeFormerTS to predict the full object geometry implicitly as occupancy probabilities at query points on a grid from which we can optionally extract the completed mesh using Marching Cubes Lorensen1987MarchingCA. Once this network is trained, the predicted occupancy probabilities are used as input for the grasp predictor Winkelbauer2022GraspPredictor both during training on ground truth (GT) grasps as well as during deployment on our robot Agile Justin.
  • Figure 3: Comparison of standard depth rendering (left), Kinect depth simulation (middle), and the real Kinect depth image (right). The object mesh is obtained via a laser scanner, and its pose through a registration step.
  • Figure 4: The DLR-Hand II Butterfass2001 which we use in our experiments. Each red dot represents a potential contact point. In the left image, all contact points used in the grasp planner are shown, while the right image shows only the central contact points that are used to verify that a finger is still in full contact with the object after slight pose variations.
  • Figure 5: Adapted joint predictor network architecture with multiple heads. Each head predicts one configuration together with a logit $l_k$, which is used to determine which head to use at inference.
  • ...and 5 more figures