Table of Contents
Fetching ...

Sim-Grasp: Learning 6-DOF Grasp Policies for Cluttered Environments Using a Synthetic Benchmark

Juncheng Li, David J. Cappelleri

TL;DR

Sim-Grasp addresses robust 6-DOF grasping in clutter by learning from a large synthetic benchmark. It fuses a 6-DOF grasp network (Sim-GraspNet) with multi-modal policies (object-agnostic, text-prompt, and box-prompt) to enable open-set grasping and target picking, leveraging GroundingDINO and SAM for semantic guidance. The Sim-Grasp-Dataset provides 1,550 objects across 500 cluttered scenes with ~7.8M 6D grasp labels generated via physics-based simulation, and the system achieves state-of-the-art performance on both isolated and cluttered tasks (e.g., 97.14% single-object success; 87.43% and 83.33% in cluttered levels 1–2 and 3–4, respectively), demonstrating robust sim-to-real transfer on a Fetch robot. Limitations include handling transparent and deformable objects without tactile feedback, motivating future work on closed-loop sensing and manipulation with tactile transducers and force sensing.

Abstract

In this paper, we present Sim-Grasp, a robust 6-DOF two-finger grasping system that integrates advanced language models for enhanced object manipulation in cluttered environments. We introduce the Sim-Grasp-Dataset, which includes 1,550 objects across 500 scenarios with 7.9 million annotated labels, and develop Sim-GraspNet to generate grasp poses from point clouds. The Sim-Grasp-Polices achieve grasping success rates of 97.14% for single objects and 87.43% and 83.33% for mixed clutter scenarios of Levels 1-2 and Levels 3-4 objects, respectively. By incorporating language models for target identification through text and box prompts, Sim-Grasp enables both object-agnostic and target picking, pushing the boundaries of intelligent robotic systems.

Sim-Grasp: Learning 6-DOF Grasp Policies for Cluttered Environments Using a Synthetic Benchmark

TL;DR

Sim-Grasp addresses robust 6-DOF grasping in clutter by learning from a large synthetic benchmark. It fuses a 6-DOF grasp network (Sim-GraspNet) with multi-modal policies (object-agnostic, text-prompt, and box-prompt) to enable open-set grasping and target picking, leveraging GroundingDINO and SAM for semantic guidance. The Sim-Grasp-Dataset provides 1,550 objects across 500 cluttered scenes with ~7.8M 6D grasp labels generated via physics-based simulation, and the system achieves state-of-the-art performance on both isolated and cluttered tasks (e.g., 97.14% single-object success; 87.43% and 83.33% in cluttered levels 1–2 and 3–4, respectively), demonstrating robust sim-to-real transfer on a Fetch robot. Limitations include handling transparent and deformable objects without tactile feedback, motivating future work on closed-loop sensing and manipulation with tactile transducers and force sensing.

Abstract

In this paper, we present Sim-Grasp, a robust 6-DOF two-finger grasping system that integrates advanced language models for enhanced object manipulation in cluttered environments. We introduce the Sim-Grasp-Dataset, which includes 1,550 objects across 500 scenarios with 7.9 million annotated labels, and develop Sim-GraspNet to generate grasp poses from point clouds. The Sim-Grasp-Polices achieve grasping success rates of 97.14% for single objects and 87.43% and 83.33% for mixed clutter scenarios of Levels 1-2 and Levels 3-4 objects, respectively. By incorporating language models for target identification through text and box prompts, Sim-Grasp enables both object-agnostic and target picking, pushing the boundaries of intelligent robotic systems.
Paper Structure (22 sections, 6 equations, 8 figures, 3 tables)

This paper contains 22 sections, 6 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of Sim-Grasp system. Sim-Grasp is a deep-learning based system to determine the robust 6-DOF two-finger grasp poses in cluttered environments.
  • Figure 2: Example of a 6D Grasping Label Dataset. To better visualize the dataset, only a subset of the candidate grasps is displayed after passing collision checks. Green markers indicate successful grasps with a grasp score of 1, while red markers represent unsuccessful grasps with a grasp score of 0.
  • Figure 3: Parameterization of the Approach-Based Grasp Sampling Schemes. The diagram illustrates the key parameters defining the gripper's positioning and orientation for sampling grasps. The angle $\alpha$ represents the cone angle relative to the surface normal at the sampling point, determining the range of possible approach directions. $G$ denotes the gripper candidates' configuration. $D$ is the standoff distance from the target object. $V$ is the vector representing the gripper's approach direction, and $A$ indicates the in-plane rotation angle around the gripper's approach direction. The triangle mesh of the target object is used to check overlap with the gripper configuration $G$.
  • Figure 4: Sim-Grasp Architecture. The Sim-GraspNet network provides the backbone for the Sim-Grasp multi-modal grasping policies. The green marker represents the 6D grasp pose for the object instance with the highest confidence score. The transparency of the blue markers indicates the confidence score, with higher transparency implying lower confidence and vice versa.
  • Figure 5: The experiment setup with Fetch robot equipped with RGB-D camera. The robot picks up objects from the workspace and drops them in the collection bin. We choose 64 household items, with 13 objects in Level 1, 19 objects in Level 2, 21 objects in Level 3, and 11 objects in Level 4.
  • ...and 3 more figures