Table of Contents
Fetching ...

SuperQ-GRASP: Superquadrics-based Grasp Pose Estimation on Larger Objects for Mobile-Manipulation

Xun Tu, Karthik Desingh

TL;DR

SuperQ-GRASP introduces a geometric grasping pipeline that models large objects by reconstructing a mesh from multi-view RGB images with a NeRF-based approach and decomposing it into superquadrics. Grasp candidates are generated on each SQ via principled sampling and then validated for collision avoidance and near-antipodal stability, enabling proximal and reliable grasps for mobile manipulation. Across synthetic and real objects, the method demonstrates improved proximity and validity of grasps compared to baselines, with real-world experiments on Spot showing robustness to viewpoint variation, though pose estimation accuracy remains a key bottleneck. The work advances grasp planning for non-tabletop, high-genus objects by combining implicit modeling, primitive-based representation, and efficient grasp sampling into a cohesive pipeline with demonstrated practical impact for mobile manipulation tasks.

Abstract

Grasp planning and estimation have been a longstanding research problem in robotics, with two main approaches to find graspable poses on the objects: 1) geometric approach, which relies on 3D models of objects and the gripper to estimate valid grasp poses, and 2) data-driven, learning-based approach, with models trained to identify grasp poses from raw sensor observations. The latter assumes comprehensive geometric coverage during the training phase. However, the data-driven approach is typically biased toward tabletop scenarios and struggle to generalize to out-of-distribution scenarios with larger objects (e.g. chair). Additionally, raw sensor data (e.g. RGB-D data) from a single view of these larger objects is often incomplete and necessitates additional observations. In this paper, we take a geometric approach, leveraging advancements in object modeling (e.g. NeRF) to build an implicit model by taking RGB images from views around the target object. This model enables the extraction of explicit mesh model while also capturing the visual appearance from novel viewpoints that is useful for perception tasks like object detection and pose estimation. We further decompose the NeRF-reconstructed 3D mesh into superquadrics (SQs) -- parametric geometric primitives, each mapped to a set of precomputed grasp poses, allowing grasp composition on the target object based on these primitives. Our proposed pipeline overcomes the problems: a) noisy depth and incomplete view of the object, with a modeling step, and b) generalization to objects of any size. For more qualitative results, refer to the supplementary video and webpage https://bit.ly/3ZrOanU

SuperQ-GRASP: Superquadrics-based Grasp Pose Estimation on Larger Objects for Mobile-Manipulation

TL;DR

SuperQ-GRASP introduces a geometric grasping pipeline that models large objects by reconstructing a mesh from multi-view RGB images with a NeRF-based approach and decomposing it into superquadrics. Grasp candidates are generated on each SQ via principled sampling and then validated for collision avoidance and near-antipodal stability, enabling proximal and reliable grasps for mobile manipulation. Across synthetic and real objects, the method demonstrates improved proximity and validity of grasps compared to baselines, with real-world experiments on Spot showing robustness to viewpoint variation, though pose estimation accuracy remains a key bottleneck. The work advances grasp planning for non-tabletop, high-genus objects by combining implicit modeling, primitive-based representation, and efficient grasp sampling into a cohesive pipeline with demonstrated practical impact for mobile manipulation tasks.

Abstract

Grasp planning and estimation have been a longstanding research problem in robotics, with two main approaches to find graspable poses on the objects: 1) geometric approach, which relies on 3D models of objects and the gripper to estimate valid grasp poses, and 2) data-driven, learning-based approach, with models trained to identify grasp poses from raw sensor observations. The latter assumes comprehensive geometric coverage during the training phase. However, the data-driven approach is typically biased toward tabletop scenarios and struggle to generalize to out-of-distribution scenarios with larger objects (e.g. chair). Additionally, raw sensor data (e.g. RGB-D data) from a single view of these larger objects is often incomplete and necessitates additional observations. In this paper, we take a geometric approach, leveraging advancements in object modeling (e.g. NeRF) to build an implicit model by taking RGB images from views around the target object. This model enables the extraction of explicit mesh model while also capturing the visual appearance from novel viewpoints that is useful for perception tasks like object detection and pose estimation. We further decompose the NeRF-reconstructed 3D mesh into superquadrics (SQs) -- parametric geometric primitives, each mapped to a set of precomputed grasp poses, allowing grasp composition on the target object based on these primitives. Our proposed pipeline overcomes the problems: a) noisy depth and incomplete view of the object, with a modeling step, and b) generalization to objects of any size. For more qualitative results, refer to the supplementary video and webpage https://bit.ly/3ZrOanU

Paper Structure

This paper contains 18 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of grasp pose estimation methods using the whole mesh model vs. raw sensor observations from a single viewpoint. Left: grasp poses predicted by our method on the whole mesh. Right: grasp poses predicted by feeding only the partial depth point cloud to Contact-GraspNet contact (blue points: observed depth point cloud).
  • Figure 2: Grasp pose candidates on individual superquadrics of different shapes, specified by different pairs of parameters ($\varepsilon_1$, $\varepsilon_2$)
  • Figure 3: Illustration of boundary points and locations of grasp pose candidates in a quadrant on the individual superquadrics. For the continuous region, the grasp pose candidates are obtained along the sample points on the mesh. For the discontinuous region, the grasp pose candidates are sampled along the asymptotic lines.
  • Figure 4: Objects in the dataset. In total, there are 15 synthetic objects from PartNet-Mobility Xiang_2020_SAPIEN and 5 objects from the real world.
  • Figure 5: Qualitative results of predicted grasp poses at various gripper positions for Chair 3 in the synthetic dataset. In all cases, our method SuperQ-GRASP consistently identifies the closest superquadric and estimates valid grasp poses accordingly. Red indicates invalid grasps, while green indicates valid grasps.
  • ...and 1 more figures