Sim-Grasp: Learning 6-DOF Grasp Policies for Cluttered Environments Using a Synthetic Benchmark
Juncheng Li, David J. Cappelleri
TL;DR
Sim-Grasp addresses robust 6-DOF grasping in clutter by learning from a large synthetic benchmark. It fuses a 6-DOF grasp network (Sim-GraspNet) with multi-modal policies (object-agnostic, text-prompt, and box-prompt) to enable open-set grasping and target picking, leveraging GroundingDINO and SAM for semantic guidance. The Sim-Grasp-Dataset provides 1,550 objects across 500 cluttered scenes with ~7.8M 6D grasp labels generated via physics-based simulation, and the system achieves state-of-the-art performance on both isolated and cluttered tasks (e.g., 97.14% single-object success; 87.43% and 83.33% in cluttered levels 1–2 and 3–4, respectively), demonstrating robust sim-to-real transfer on a Fetch robot. Limitations include handling transparent and deformable objects without tactile feedback, motivating future work on closed-loop sensing and manipulation with tactile transducers and force sensing.
Abstract
In this paper, we present Sim-Grasp, a robust 6-DOF two-finger grasping system that integrates advanced language models for enhanced object manipulation in cluttered environments. We introduce the Sim-Grasp-Dataset, which includes 1,550 objects across 500 scenarios with 7.9 million annotated labels, and develop Sim-GraspNet to generate grasp poses from point clouds. The Sim-Grasp-Polices achieve grasping success rates of 97.14% for single objects and 87.43% and 83.33% for mixed clutter scenarios of Levels 1-2 and Levels 3-4 objects, respectively. By incorporating language models for target identification through text and box prompts, Sim-Grasp enables both object-agnostic and target picking, pushing the boundaries of intelligent robotic systems.
