Table of Contents
Fetching ...

ICGNet: A Unified Approach for Instance-Centric Grasping

René Zurbrügg, Yifan Liu, Francis Engelmann, Suryansh Kumar, Marco Hutter, Vaishakh Patil, Fisher Yu

TL;DR

This paper addresses robust grasping in cluttered environments from a single view by introducing ICGNet, a unified instance-centric framework. The method constructs per-object embeddings from a sparse multi-scale feature volume, refining them via masked cross- and self-attention to enable object-level reconstruction and grasp prediction. It jointly predicts instance segmentation, 3D occupancy, and grasp affordances, supporting target-driven interactions and collision checks. Experiments on synthetic datasets show state-of-the-art performance in packed and piled decluttering, with sim-to-real transfer validated on real robot tasks.

Abstract

Accurate grasping is the key to several robotic tasks including assembly and household robotics. Executing a successful grasp in a cluttered environment requires multiple levels of scene understanding: First, the robot needs to analyze the geometric properties of individual objects to find feasible grasps. These grasps need to be compliant with the local object geometry. Second, for each proposed grasp, the robot needs to reason about the interactions with other objects in the scene. Finally, the robot must compute a collision-free grasp trajectory while taking into account the geometry of the target object. Most grasp detection algorithms directly predict grasp poses in a monolithic fashion, which does not capture the composability of the environment. In this paper, we introduce an end-to-end architecture for object-centric grasping. The method uses pointcloud data from a single arbitrary viewing direction as an input and generates an instance-centric representation for each partially observed object in the scene. This representation is further used for object reconstruction and grasp detection in cluttered table-top scenes. We show the effectiveness of the proposed method by extensively evaluating it against state-of-the-art methods on synthetic datasets, indicating superior performance for grasping and reconstruction. Additionally, we demonstrate real-world applicability by decluttering scenes with varying numbers of objects.

ICGNet: A Unified Approach for Instance-Centric Grasping

TL;DR

This paper addresses robust grasping in cluttered environments from a single view by introducing ICGNet, a unified instance-centric framework. The method constructs per-object embeddings from a sparse multi-scale feature volume, refining them via masked cross- and self-attention to enable object-level reconstruction and grasp prediction. It jointly predicts instance segmentation, 3D occupancy, and grasp affordances, supporting target-driven interactions and collision checks. Experiments on synthetic datasets show state-of-the-art performance in packed and piled decluttering, with sim-to-real transfer validated on real robot tasks.

Abstract

Accurate grasping is the key to several robotic tasks including assembly and household robotics. Executing a successful grasp in a cluttered environment requires multiple levels of scene understanding: First, the robot needs to analyze the geometric properties of individual objects to find feasible grasps. These grasps need to be compliant with the local object geometry. Second, for each proposed grasp, the robot needs to reason about the interactions with other objects in the scene. Finally, the robot must compute a collision-free grasp trajectory while taking into account the geometry of the target object. Most grasp detection algorithms directly predict grasp poses in a monolithic fashion, which does not capture the composability of the environment. In this paper, we introduce an end-to-end architecture for object-centric grasping. The method uses pointcloud data from a single arbitrary viewing direction as an input and generates an instance-centric representation for each partially observed object in the scene. This representation is further used for object reconstruction and grasp detection in cluttered table-top scenes. We show the effectiveness of the proposed method by extensively evaluating it against state-of-the-art methods on synthetic datasets, indicating superior performance for grasping and reconstruction. Additionally, we demonstrate real-world applicability by decluttering scenes with varying numbers of objects.
Paper Structure (5 sections, 5 equations, 5 figures, 3 tables)

This paper contains 5 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of our network predictions. Given a single view pointcloud we jointly predict instance segmentation masks, collision-free grasp predictions and reconstructions for each object.
  • Figure 2: Grasp Representation.Left:$\mathbf{c}$ is the contact point of the closed gripper and $\mathbf{n}$ is its estimated surface normal. $\mathbf a$ is the approach direction of the gripper. Given the gravity vector $\mathbf z$ and surface normal $\mathbf n$, $\mathbf a$ can be uniquely defined by the approach angle $\alpha$. Top Right: Grasp ambiguity of different grasp representations. When dealing with a particular contact or gripper center, there can be multiple feasible approach direction resulting in a successful grasp. Bottom Right: For each contact point, our representation enables the prediction of grasp qualities for different gripper orientations perpendicular to the surface normal.
  • Figure 3: Model Overview. Given an input pointcloud, we voxelize the pointcloud and extract volumetric and surface features at multiple scales using a sparse Minkowski- choy20194d and dense U-Net cciccek20163d. The surface features are enriched with volumetric information and treated as tokens with positional encodings based on voxel locations. Masked attention iteratively refines instance queries by cross-attending to extracted sparse tokens. This process allows each latent query to focus on a specific instance and to be classified as "<semantic class>" or "no object". The refined queries condition the task-specific decoders to model the occupancy of each instance directly or to predict grasp affordance scores and gripper widths.
  • Figure 4: Our grasp predictions on simulated, unseen test objects fromdowns2022google. Predicted grasps for "bottle", "can" and "box" in the packed (left) and pile setup (right). More qualitative examples at can be found on the project page
  • Figure 5: Real world experimental setup. We use 17 different objects of which 3-6 are placed on a $30$ cm$^3$ workspace.