RGBSQGrasp: Inferring Local Superquadric Primitives from Single RGB Image for Graspability-Aware Bin Picking

Yifeng Xu; Fan Zhu; Ye Li; Sebastian Ren; Xiaonan Huang; Yuhao Chen

RGBSQGrasp: Inferring Local Superquadric Primitives from Single RGB Image for Graspability-Aware Bin Picking

Yifeng Xu, Fan Zhu, Ye Li, Sebastian Ren, Xiaonan Huang, Yuhao Chen

TL;DR

RGBSQGrasp addresses bin-picking under severe occlusions by inferring local superquadric primitives from a single RGB image, removing the need for depth sensors or known CAD models. It combines a synthetic dataset, a dual-branch superquadric fitting network, RGB-based scene understanding with depth-from-RGB, and an SQ-guided grasp sampling module to generate stable grasps for unseen objects. The approach achieves a real-robot grasp success rate of 92% in packed bin-picking experiments and outperforms depth-reliant baselines, while ablation studies confirm the value of joint global-local features and refinement. This work enables robust, generalizable grasping in clutter and occlusion, with applicability to other end-effectors and grasp-enabled tasks.

Abstract

Bin picking is a challenging robotic task due to occlusions and physical constraints that limit visual information for object recognition and grasping. Existing approaches often rely on known CAD models or prior object geometries, restricting generalization to novel or unknown objects. Other methods directly regress grasp poses from RGB-D data without object priors, but the inherent noise in depth sensing and the lack of object understanding make grasp synthesis and evaluation more difficult. Superquadrics (SQ) offer a compact, interpretable shape representation that captures the physical and graspability understanding of objects. However, recovering them from limited viewpoints is challenging, as existing methods rely on multiple perspectives for near-complete point cloud reconstruction, limiting their effectiveness in bin-picking. To address these challenges, we propose \textbf{RGBSQGrasp}, a grasping framework that leverages superquadric shape primitives and foundation metric depth estimation models to infer grasp poses from a monocular RGB camera -- eliminating the need for depth sensors. Our framework integrates a universal, cross-platform dataset generation pipeline, a foundation model-based object point cloud estimation module, a global-local superquadric fitting network, and an SQ-guided grasp pose sampling module. By integrating these components, RGBSQGrasp reliably infers grasp poses through geometric reasoning, enhancing grasp stability and adaptability to unseen objects. Real-world robotic experiments demonstrate a 92% grasp success rate, highlighting the effectiveness of RGBSQGrasp in packed bin-picking environments.

RGBSQGrasp: Inferring Local Superquadric Primitives from Single RGB Image for Graspability-Aware Bin Picking

TL;DR

Abstract

RGBSQGrasp: Inferring Local Superquadric Primitives from Single RGB Image for Graspability-Aware Bin Picking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)