Table of Contents
Fetching ...

ALDM-Grasping: Diffusion-aided Zero-Shot Sim-to-Real Transfer for Robot Grasping

Yiwei Li, Zihao Wu, Huaqin Zhao, Tianze Yang, Zhengliang Liu, Peng Shu, Jin Sun, Ramviyas Parasuraman, Tianming Liu

TL;DR

Experimental results indicate this framework outperforms existing models in both success rates and adaptability to new environments through improvements in the accuracy and reliability of visual grasping actions under a variety of conditions.

Abstract

To tackle the "reality gap" encountered in Sim-to-Real transfer, this study proposes a diffusion-based framework that minimizes inconsistencies in grasping actions between the simulation settings and realistic environments. The process begins by training an adversarial supervision layout-to-image diffusion model(ALDM). Then, leverage the ALDM approach to enhance the simulation environment, rendering it with photorealistic fidelity, thereby optimizing robotic grasp task training. Experimental results indicate this framework outperforms existing models in both success rates and adaptability to new environments through improvements in the accuracy and reliability of visual grasping actions under a variety of conditions. Specifically, it achieves a 75\% success rate in grasping tasks under plain backgrounds and maintains a 65\% success rate in more complex scenarios. This performance demonstrates this framework excels at generating controlled image content based on text descriptions, identifying object grasp points, and demonstrating zero-shot learning in complex, unseen scenarios.

ALDM-Grasping: Diffusion-aided Zero-Shot Sim-to-Real Transfer for Robot Grasping

TL;DR

Experimental results indicate this framework outperforms existing models in both success rates and adaptability to new environments through improvements in the accuracy and reliability of visual grasping actions under a variety of conditions.

Abstract

To tackle the "reality gap" encountered in Sim-to-Real transfer, this study proposes a diffusion-based framework that minimizes inconsistencies in grasping actions between the simulation settings and realistic environments. The process begins by training an adversarial supervision layout-to-image diffusion model(ALDM). Then, leverage the ALDM approach to enhance the simulation environment, rendering it with photorealistic fidelity, thereby optimizing robotic grasp task training. Experimental results indicate this framework outperforms existing models in both success rates and adaptability to new environments through improvements in the accuracy and reliability of visual grasping actions under a variety of conditions. Specifically, it achieves a 75\% success rate in grasping tasks under plain backgrounds and maintains a 65\% success rate in more complex scenarios. This performance demonstrates this framework excels at generating controlled image content based on text descriptions, identifying object grasp points, and demonstrating zero-shot learning in complex, unseen scenarios.
Paper Structure (18 sections, 4 equations, 4 figures, 2 tables)

This paper contains 18 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: ALDM has great performance in zero-shot image generation and style transfer. The image generation model used in this article contains data of various types and can generate objects in various robotic application scenarios. The generated image is not only close to the realistic-style scene but also ensures that the number and position of key objects are consistent with the original image. There is huge potential to provide more diverse, realistic, and accurate training samples for robot action planning applications.
  • Figure 2: The whole pipeline of the diffusion-model-based grasping robot. a) The generation procedures for realistic style images. b) The robot agent network designed for the task of object detection and grasping.
  • Figure 3: Experiment results in image generation. In the Gazebo column, we constructed five simulation scenarios in the Gazebo framework, meticulously designed to align with the physical layout of the laboratory environment. In the Segmentation column, the outcomes of object segmentation, derived from the simulated settings, serve as the inputs for the layout-to-diffusion model, delineated through predefined labels. In the ControlNet column, the images show incorrect image contents and inaccurate object positions. In the ALDM column, it shows the output generated by the image synthesis model. Notably, the produced imagery not only exhibits precise localization of the target objects but also embodies a stylistic resemblance to the actual environmental setting, demonstrating the model's efficacy in bridging the gap between the simulated images' style and laboratory environment.
  • Figure 4: The object detection results of the physical robot. The first panel represents the detection performance of the simple background, and the second panel represents the complex background. It can be seen that the target objects are successfully detected while the background has lots of interference objects.