Table of Contents
Fetching ...

RGBSQGrasp: Inferring Local Superquadric Primitives from Single RGB Image for Graspability-Aware Bin Picking

Yifeng Xu, Fan Zhu, Ye Li, Sebastian Ren, Xiaonan Huang, Yuhao Chen

TL;DR

RGBSQGrasp addresses bin-picking under severe occlusions by inferring local superquadric primitives from a single RGB image, removing the need for depth sensors or known CAD models. It combines a synthetic dataset, a dual-branch superquadric fitting network, RGB-based scene understanding with depth-from-RGB, and an SQ-guided grasp sampling module to generate stable grasps for unseen objects. The approach achieves a real-robot grasp success rate of 92% in packed bin-picking experiments and outperforms depth-reliant baselines, while ablation studies confirm the value of joint global-local features and refinement. This work enables robust, generalizable grasping in clutter and occlusion, with applicability to other end-effectors and grasp-enabled tasks.

Abstract

Bin picking is a challenging robotic task due to occlusions and physical constraints that limit visual information for object recognition and grasping. Existing approaches often rely on known CAD models or prior object geometries, restricting generalization to novel or unknown objects. Other methods directly regress grasp poses from RGB-D data without object priors, but the inherent noise in depth sensing and the lack of object understanding make grasp synthesis and evaluation more difficult. Superquadrics (SQ) offer a compact, interpretable shape representation that captures the physical and graspability understanding of objects. However, recovering them from limited viewpoints is challenging, as existing methods rely on multiple perspectives for near-complete point cloud reconstruction, limiting their effectiveness in bin-picking. To address these challenges, we propose \textbf{RGBSQGrasp}, a grasping framework that leverages superquadric shape primitives and foundation metric depth estimation models to infer grasp poses from a monocular RGB camera -- eliminating the need for depth sensors. Our framework integrates a universal, cross-platform dataset generation pipeline, a foundation model-based object point cloud estimation module, a global-local superquadric fitting network, and an SQ-guided grasp pose sampling module. By integrating these components, RGBSQGrasp reliably infers grasp poses through geometric reasoning, enhancing grasp stability and adaptability to unseen objects. Real-world robotic experiments demonstrate a 92% grasp success rate, highlighting the effectiveness of RGBSQGrasp in packed bin-picking environments.

RGBSQGrasp: Inferring Local Superquadric Primitives from Single RGB Image for Graspability-Aware Bin Picking

TL;DR

RGBSQGrasp addresses bin-picking under severe occlusions by inferring local superquadric primitives from a single RGB image, removing the need for depth sensors or known CAD models. It combines a synthetic dataset, a dual-branch superquadric fitting network, RGB-based scene understanding with depth-from-RGB, and an SQ-guided grasp sampling module to generate stable grasps for unseen objects. The approach achieves a real-robot grasp success rate of 92% in packed bin-picking experiments and outperforms depth-reliant baselines, while ablation studies confirm the value of joint global-local features and refinement. This work enables robust, generalizable grasping in clutter and occlusion, with applicability to other end-effectors and grasp-enabled tasks.

Abstract

Bin picking is a challenging robotic task due to occlusions and physical constraints that limit visual information for object recognition and grasping. Existing approaches often rely on known CAD models or prior object geometries, restricting generalization to novel or unknown objects. Other methods directly regress grasp poses from RGB-D data without object priors, but the inherent noise in depth sensing and the lack of object understanding make grasp synthesis and evaluation more difficult. Superquadrics (SQ) offer a compact, interpretable shape representation that captures the physical and graspability understanding of objects. However, recovering them from limited viewpoints is challenging, as existing methods rely on multiple perspectives for near-complete point cloud reconstruction, limiting their effectiveness in bin-picking. To address these challenges, we propose \textbf{RGBSQGrasp}, a grasping framework that leverages superquadric shape primitives and foundation metric depth estimation models to infer grasp poses from a monocular RGB camera -- eliminating the need for depth sensors. Our framework integrates a universal, cross-platform dataset generation pipeline, a foundation model-based object point cloud estimation module, a global-local superquadric fitting network, and an SQ-guided grasp pose sampling module. By integrating these components, RGBSQGrasp reliably infers grasp poses through geometric reasoning, enhancing grasp stability and adaptability to unseen objects. Real-world robotic experiments demonstrate a 92% grasp success rate, highlighting the effectiveness of RGBSQGrasp in packed bin-picking environments.

Paper Structure

This paper contains 15 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the RGBSQGrasp framework. (1) Dataset generation using cross-platform simulators to create partial point clouds and superquadric ground truth pairs, (2) Superquadric fitting network with local and global feature extraction, (3) Object scene understanding using RGB image scan, depth estimation, and superquadric primitive fitting, and (4) Deployment in a real bin-picking experimental setup with a UR5e manipulator and RGB camera.
  • Figure 2: Shape primitive space for convex superquadrics.
  • Figure 3: We illustrate a sequential rollout of the superquadrics-guided robotic grasping process. The red point cloud represents the superquadric fitting for each partial point cloud, while the green vector denotes the grasp sampled from the fitted superquadrics. The top row visualizes the grasping sequence, and the bottom row depicts the evolving scene state after each step.
  • Figure 4: Grasp sampling workflow: Grasp candidates are generated based on superquadric fitting, with selection prioritized from high-quality regions and proximity to the object's center of mass (COM) to ensure stable execution.
  • Figure 5: Example scenes for the real-robot experiments.
  • ...and 1 more figures