Table of Contents
Fetching ...

SuperGrasp: Single-View Object Grasping via Superquadric Similarity Matching, Evaluation, and Refinement

Lijingze Xiao, Jinhong Du, Yang Cong, Supeng Diao, Yu Ren

Abstract

Robotic grasping from single-view observations remains a critical challenge in manipulation. Existing methods still struggle to generate stable and valid grasp poses when confronted with incomplete geometric information. To address these limitations, we propose SuperGrasp, a novel two-stage framework for single-view grasping with parallel-jaw grippers that decomposes the grasping process into initial grasp pose generation and subsequent grasp evaluation and refinement. In the first stage, we introduce a Similarity Matching Module that efficiently retrieves grasp candidates by matching the input single-view point cloud with a pre-computed primitive dataset based on superquadric coefficients. In the second stage, we propose E-RNet, an end-to-end network that expands the graspaware region and takes the initial grasp closure region as a local anchor region, enabling more accurate and reliable evaluation and refinement of grasp candidates. To enhance generalization, we construct a primitive dataset containing 1.5k primitives for similarity matching and collect a large-scale point cloud dataset with 100k stable grasp labels from 124 objects for network training. Extensive experiments in both simulation and realworld environments demonstrate that our method achieves stable grasping performance and strong generalization across varying scenes and novel objects.

SuperGrasp: Single-View Object Grasping via Superquadric Similarity Matching, Evaluation, and Refinement

Abstract

Robotic grasping from single-view observations remains a critical challenge in manipulation. Existing methods still struggle to generate stable and valid grasp poses when confronted with incomplete geometric information. To address these limitations, we propose SuperGrasp, a novel two-stage framework for single-view grasping with parallel-jaw grippers that decomposes the grasping process into initial grasp pose generation and subsequent grasp evaluation and refinement. In the first stage, we introduce a Similarity Matching Module that efficiently retrieves grasp candidates by matching the input single-view point cloud with a pre-computed primitive dataset based on superquadric coefficients. In the second stage, we propose E-RNet, an end-to-end network that expands the graspaware region and takes the initial grasp closure region as a local anchor region, enabling more accurate and reliable evaluation and refinement of grasp candidates. To enhance generalization, we construct a primitive dataset containing 1.5k primitives for similarity matching and collect a large-scale point cloud dataset with 100k stable grasp labels from 124 objects for network training. Extensive experiments in both simulation and realworld environments demonstrate that our method achieves stable grasping performance and strong generalization across varying scenes and novel objects.

Paper Structure

This paper contains 20 sections, 14 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of the task pipeline. Under dense clutter and single-view observation constraints, the tabletop decluttering task is accomplished through a two-stage pipeline consisting of grasp candidate generation, grasp evaluation, and refinement.
  • Figure 2: Convex superquadric shapes under different parameter settings.
  • Figure 3: The primitive object database constructed in this work, which contains 1.5k objects of various shapes and scales, including cylinders, frustums, elliptical cylinders, and cuboids.
  • Figure 4: Overview: Our framework consists of two stages. In the first stage, we first estimate the superquadric coefficients of the target object and compute their similarity to those of the objects in the database. We then select the top-$N$ most similar objects and transfer their grasp poses to the target object through a transformation matrix $T$, followed by a coarse filtering step to remove noisy candidates. In the second stage, we enlarge the perception region of each initial grasp and feed the cropped regional point cloud into a PointNet++ network for feature extraction. We then further crop the features corresponding to the points within the initial gripper closing region and aggregate them via max pooling to obtain the grasp anchor feature. Finally, the anchor feature is fed into an evaluation network and a refinement network to predict grasp feasibility and refinement feasibility, respectively.
  • Figure 5: Gripper configuration and refinement strategy. In (a), the $x$-, $y$-, and $z$-axes define the gripper coordinate system, $w$ represents the gripper width, $L$ denotes the fingertip length, and $F_d$ is the applied disturbance force directed vertically downward. In (b), the initial gripper is first extended downward by $0.008\,\mathrm{m}$ along the $z$-axis, and is then rotated by $\pm 15^\circ$ about the $z$-axis.
  • ...and 2 more figures