Table of Contents
Fetching ...

MultiGraspNet: A Multitask 3D Vision Model for Multi-gripper Robotic Grasping

Stephany Ortuno-Chanelo, Paolo Rabino, Enrico Civitelli, Tatiana Tommasi, Raffaello Camoriano

TL;DR

MultiGraspNet addresses the limitations of single-gripper grasping by proposing a multitask 3D vision model that jointly predicts grasp poses for parallel and vacuum grippers from a single 3D point cloud. It employs a shared Minkowski-based backbone with gripper-specific refiners and learns per-point graspness maps by aligning GraspNet-1Billion and SuctionNet-1Billion data, enabling knowledge transfer across grasp modalities. The approach achieves competitive or superior performance versus single-task baselines on seen and novel objects and demonstrates real-world viability on a single-arm, dual-gripper robot, including improved vacuum grasping and robust parallel grasping. The work contributes a unified multitask architecture, an aligned dual-gripper dataset, and extensive experiments, highlighting the practicality of multitask learning for versatile robotic manipulation.

Abstract

Vision-based models for robotic grasping automate critical, repetitive, and draining industrial tasks. Existing approaches are typically limited in two ways: they either target a single gripper and are potentially applied on costly dual-arm setups, or rely on custom hybrid grippers that require ad-hoc learning procedures with logic that cannot be transferred across tasks, restricting their general applicability. In this work, we present MultiGraspNet, a novel multitask 3D deep learning method that predicts feasible poses simultaneously for parallel and vacuum grippers within a unified framework, enabling a single robot to handle multiple end effectors. The model is trained on the richly annotated GraspNet-1Billion and SuctionNet-1Billion datasets, which have been aligned for the purpose, and generates graspability masks quantifying the suitability of each scene point for successful grasps. By sharing early-stage features while maintaining gripper-specific refiners, MultiGraspNet effectively leverages complementary information across grasping modalities, enhancing robustness and adaptability in cluttered scenes. We characterize MultiGraspNet's performance with an extensive experimental analysis, demonstrating its competitiveness with single-task models on relevant benchmarks. We run real-world experiments on a single-arm multi-gripper robotic setup showing that our approach outperforms the vacuum baseline, grasping 16% percent more seen objects and 32% more of the novel ones, while obtaining competitive results for the parallel task.

MultiGraspNet: A Multitask 3D Vision Model for Multi-gripper Robotic Grasping

TL;DR

MultiGraspNet addresses the limitations of single-gripper grasping by proposing a multitask 3D vision model that jointly predicts grasp poses for parallel and vacuum grippers from a single 3D point cloud. It employs a shared Minkowski-based backbone with gripper-specific refiners and learns per-point graspness maps by aligning GraspNet-1Billion and SuctionNet-1Billion data, enabling knowledge transfer across grasp modalities. The approach achieves competitive or superior performance versus single-task baselines on seen and novel objects and demonstrates real-world viability on a single-arm, dual-gripper robot, including improved vacuum grasping and robust parallel grasping. The work contributes a unified multitask architecture, an aligned dual-gripper dataset, and extensive experiments, highlighting the practicality of multitask learning for versatile robotic manipulation.

Abstract

Vision-based models for robotic grasping automate critical, repetitive, and draining industrial tasks. Existing approaches are typically limited in two ways: they either target a single gripper and are potentially applied on costly dual-arm setups, or rely on custom hybrid grippers that require ad-hoc learning procedures with logic that cannot be transferred across tasks, restricting their general applicability. In this work, we present MultiGraspNet, a novel multitask 3D deep learning method that predicts feasible poses simultaneously for parallel and vacuum grippers within a unified framework, enabling a single robot to handle multiple end effectors. The model is trained on the richly annotated GraspNet-1Billion and SuctionNet-1Billion datasets, which have been aligned for the purpose, and generates graspability masks quantifying the suitability of each scene point for successful grasps. By sharing early-stage features while maintaining gripper-specific refiners, MultiGraspNet effectively leverages complementary information across grasping modalities, enhancing robustness and adaptability in cluttered scenes. We characterize MultiGraspNet's performance with an extensive experimental analysis, demonstrating its competitiveness with single-task models on relevant benchmarks. We run real-world experiments on a single-arm multi-gripper robotic setup showing that our approach outperforms the vacuum baseline, grasping 16% percent more seen objects and 32% more of the novel ones, while obtaining competitive results for the parallel task.
Paper Structure (24 sections, 2 equations, 6 figures, 3 tables)

This paper contains 24 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of MultiGraspNet. Our proposed multi-gripper approach is formulated as a multitask deep network that processes a 3D scene point cloud and learns a shared representation to jointly predict grasping quality scores for parallel and vacuum grippers based on geometric cues. The resulting graspness maps identify graspable regions in cluttered scenes and are further refined to produce the final grasp poses for both the parallel and suction grippers.
  • Figure 2: Schematic illustration of the parallel-jaw gripper (a) and vacuum gripper (b) poses. The former includes the central point $c_p$, the depth $d$, the width $\omega$, as well as the approach direction $v$ and angle $a$. The vacuum grasp pose includes the central point $c_v$ and the normal to the point $n$.
  • Figure 3: Overview of our architecture. The network takes as input a 3D point cloud. Then the Minkowski-based backbone extracts geometric features. The features are then processed by a multi-branch grasp prediction head to predict the objectness and the graspness masks for each gripper. Finally, gripper-specific refinement modules are applied to generate the multi-gripper grasping poses.
  • Figure 4: Qualitative results of MultiGraspNet. For each scene, our model predicts a graspness score map and poses for the vacuum and parallel grippers. We show the top 100 grasps, ranked by the predicted grasp scores for each gripper.
  • Figure 5: Small Data Regime: panels (a) and (c) show the performance difference of MultiGraspNet with MultiGraspNet-par and MultiGraspNet-vac over different object groups. The (b) and (d) panels present the average $\overline{AP}$ discrepancy. In each box plot, the red cross indicates the mean, while the horizontal line is the median, calculated over ten repetitions.
  • ...and 1 more figures