Table of Contents
Fetching ...

Local Occupancy-Enhanced Object Grasping with Multiple Triplanar Projection

Kangqi Ma, Hao Dong, Yadong Mu

TL;DR

This work addresses robust 6-DoF grasping from a single-view depth input under occlusions by introducing local occupancy completion around candidate grasp points. It predicts voxel occupancy in local neighborhoods using a multi-group tri-plane representation that fuses global scene context with local observations, enabling occupancy-informed grasp pose estimation. The approach jointly optimizes occupancy prediction and grasp pose decoding in an end-to-end framework and demonstrates strong gains on GraspNet-1Billion and in real-robot experiments, including reduced collision risk. The method offers practical benefits for cluttered scenes by reconstructing locally missing geometry and guiding grasp decisions with completed shape information.

Abstract

This paper addresses the challenge of robotic grasping of general objects. Similar to prior research, the task reads a single-view 3D observation (i.e., point clouds) captured by a depth camera as input. Crucially, the success of object grasping highly demands a comprehensive understanding of the shape of objects within the scene. However, single-view observations often suffer from occlusions (including both self and inter-object occlusions), which lead to gaps in the point clouds, especially in complex cluttered scenes. This renders incomplete perception of the object shape and frequently causes failures or inaccurate pose estimation during object grasping. In this paper, we tackle this issue with an effective albeit simple solution, namely completing grasping-related scene regions through local occupancy prediction. Following prior practice, the proposed model first runs by proposing a number of most likely grasp points in the scene. Around each grasp point, a module is designed to infer any voxel in its neighborhood to be either void or occupied by some object. Importantly, the occupancy map is inferred by fusing both local and global cues. We implement a multi-group tri-plane scheme for efficiently aggregating long-distance contextual information. The model further estimates 6-DoF grasp poses utilizing the local occupancy-enhanced object shape information and returns the top-ranked grasp proposal. Comprehensive experiments on both the large-scale GraspNet-1Billion benchmark and real robotic arm demonstrate that the proposed method can effectively complete the unobserved parts in cluttered and occluded scenes. Benefiting from the occupancy-enhanced feature, our model clearly outstrips other competing methods under various performance metrics such as grasping average precision.

Local Occupancy-Enhanced Object Grasping with Multiple Triplanar Projection

TL;DR

This work addresses robust 6-DoF grasping from a single-view depth input under occlusions by introducing local occupancy completion around candidate grasp points. It predicts voxel occupancy in local neighborhoods using a multi-group tri-plane representation that fuses global scene context with local observations, enabling occupancy-informed grasp pose estimation. The approach jointly optimizes occupancy prediction and grasp pose decoding in an end-to-end framework and demonstrates strong gains on GraspNet-1Billion and in real-robot experiments, including reduced collision risk. The method offers practical benefits for cluttered scenes by reconstructing locally missing geometry and guiding grasp decisions with completed shape information.

Abstract

This paper addresses the challenge of robotic grasping of general objects. Similar to prior research, the task reads a single-view 3D observation (i.e., point clouds) captured by a depth camera as input. Crucially, the success of object grasping highly demands a comprehensive understanding of the shape of objects within the scene. However, single-view observations often suffer from occlusions (including both self and inter-object occlusions), which lead to gaps in the point clouds, especially in complex cluttered scenes. This renders incomplete perception of the object shape and frequently causes failures or inaccurate pose estimation during object grasping. In this paper, we tackle this issue with an effective albeit simple solution, namely completing grasping-related scene regions through local occupancy prediction. Following prior practice, the proposed model first runs by proposing a number of most likely grasp points in the scene. Around each grasp point, a module is designed to infer any voxel in its neighborhood to be either void or occupied by some object. Importantly, the occupancy map is inferred by fusing both local and global cues. We implement a multi-group tri-plane scheme for efficiently aggregating long-distance contextual information. The model further estimates 6-DoF grasp poses utilizing the local occupancy-enhanced object shape information and returns the top-ranked grasp proposal. Comprehensive experiments on both the large-scale GraspNet-1Billion benchmark and real robotic arm demonstrate that the proposed method can effectively complete the unobserved parts in cluttered and occluded scenes. Benefiting from the occupancy-enhanced feature, our model clearly outstrips other competing methods under various performance metrics such as grasping average precision.
Paper Structure (10 sections, 8 equations, 8 figures, 8 tables)

This paper contains 10 sections, 8 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Illustration of local occupancy-enhanced object grasping. Top: The gaps in point clouds heavily affect the accuracy of estimated grasp poses, leading to a failure under self-occlusion in this case. Bottom: In our proposed method, grasp pose is estimated using local occupancy-enhanced features, which essentially infer the complete object shape locally around the grasp point by fusing global / local cues.
  • Figure 2: Model architecture of the proposed local occupancy-enhanced object grasping. It first identifies a number of local occupancy regions of interest. Then multi-group tri-plane aggregates the scene context for local occupancy estimation. Finally the occupancy-enhanced local shape feature in each grasp region is extracted by fusing the information of both explicit voxels and implicit queried features, and is decoded to grasp poses.
  • Figure 3: Settings of real-world experiments. Left: the configuration of the grasping system. Middle: objects used for grasping. Right: an example of the grasping scene.
  • Figure 4: Comparison in real-world test. Left: a tiny bottle is right below the camera so the observation is severely self-occluded. Middle: due to the lack of complete shape, the grasp pose estimated by the baseline collides with the bottle. Right: Our method reconstructs the complete shape of the grasp region and succeeds in grasping.
  • Figure 5: Visualization of predicted local occupancy and grasp poses on GraspNet-1Billion benchmark. Some grasp poses predicted by the baseline without occupancy enhancement fails in grasping the target object (i.e., the yellow bottle).
  • ...and 3 more figures