LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion

Jinyu Zhang; Yongchong Gu; Jianxiong Gao; Haitao Lin; Qiang Sun; Xinwei Sun; Xiangyang Xue; Yanwei Fu

LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion

Jinyu Zhang, Yongchong Gu, Jianxiong Gao, Haitao Lin, Qiang Sun, Xinwei Sun, Xiangyang Xue, Yanwei Fu

TL;DR

The paper tackles amodal segmentation for robotic grasping under occlusion by proposing LAC-Net, a framework that linearly fuses RGB-D features and uses the visible mask as guided attention to complete the amodal mask. It combines an off-the-shelf visible-mask network with an RGB-D amodal completion module that employs dual backbones and an attention-guided head, producing robust M_a for precise grasping. The authors demonstrate state-of-the-art performance on UOAIS-Sim and OSD-amodal and validate real-world feasibility with a Kinova robot, achieving substantially higher grasp success, particularly for center-based top grasps. The work advances practical occlusion-aware manipulation and has potential applications in cluttered environments such as debris removal in sandy contexts, with future work extending to beach-cleaning domains.

Abstract

This paper addresses the challenge of perceiving complete object shapes through visual perception. While prior studies have demonstrated encouraging outcomes in segmenting the visible parts of objects within a scene, amodal segmentation, in particular, has the potential to allow robots to infer the occluded parts of objects. To this end, this paper introduces a new framework that explores amodal segmentation for robotic grasping in cluttered scenes, thus greatly enhancing robotic grasping abilities. Initially, we use a conventional segmentation algorithm to detect the visible segments of the target object, which provides shape priors for completing the full object mask. Particularly, to explore how to utilize semantic features from RGB images and geometric information from depth images, we propose a Linear-fusion Attention-guided Convolutional Network (LAC-Net). LAC-Net utilizes the linear-fusion strategy to effectively fuse this cross-modal data, and then uses the prior visible mask as attention map to guide the network to focus on target feature locations for further complete mask recovery. Using the amodal mask of the target object provides advantages in selecting more accurate and robust grasp points compared to relying solely on the visible segments. The results on different datasets show that our method achieves state-of-the-art performance. Furthermore, the robot experiments validate the feasibility and robustness of this method in the real world. Our code and demonstrations are available on the project page: https://jrryzh.github.io/LAC-Net.

LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion

TL;DR

Abstract

Paper Structure (15 sections, 6 figures, 5 tables)

This paper contains 15 sections, 6 figures, 5 tables.

INTRODUCTION
RELATED WORKS
Amodal Instance Segmentation
Target-oriented Grasping in Clutter
METHOD
Problem Definition
Framework
Amodal Mask Completion
Grasp Point Generation
EXPERIMENT
Amodal segmentation
Visible mask
RGB-D Fusion
Real-world amodal grasping
CONCLUSIONS

Figures (6)

Figure 1: Comparison between the instance segmentation method and amodal segmentation for robotic grasp. We simulate the scene where the garbage is buried in the sand. Unlike conventional instance segmentation methods that predict solely the visible mask for selecting grasp points, our approach leverages the amodal mask to identify more robust grasp points for the final grasping action.
Figure 2: The workflow of our proposed amodal instance segmentation method for robotic grasping. Given a prompt depicting the target object within the scene, we employ Grounding DINO liu2023grounding in conjunction with the Segmentation Anything Model kirillov2023segany to localize the visible portion of the target object. Subsequently, the visible mask is fed into the coarse-to-fine amodal segmentation module to estimate the complete shape mask of the object. Finally, the robot performs a top-grasp action on the target object using the derived full mask of the target object in conjunction with depth images.
Figure 3: Test object collection used in our robotic experiment. We choose 15 instances with different shape for testing.
Figure 4: Qualitative Results of amodal segmentation in real-world Scenes: We present the original RGB image, the visible mask estimated by SAM, and the amodal mask results from both UOAIS-net and our method, respectively.
Figure 5: Comparison results of the robotic grasping between the baseline method and our method in single-object scenes. In the comparison of robotic grasping methods, the baseline UOAIS-net method often tends to grasp the edge portions of the target object, primarily due to the limitations in recovering the amodal mask. In contrast, our approach excels in grasping the object's center, as our method can estimate an approximate full mask of the target object, thereby ensuring a higher rate of overall success.
...and 1 more figures

LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion

TL;DR

Abstract

LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion

Authors

TL;DR

Abstract

Table of Contents

Figures (6)