ALDM-Grasping: Diffusion-aided Zero-Shot Sim-to-Real Transfer for Robot Grasping

Yiwei Li; Zihao Wu; Huaqin Zhao; Tianze Yang; Zhengliang Liu; Peng Shu; Jin Sun; Ramviyas Parasuraman; Tianming Liu

ALDM-Grasping: Diffusion-aided Zero-Shot Sim-to-Real Transfer for Robot Grasping

Yiwei Li, Zihao Wu, Huaqin Zhao, Tianze Yang, Zhengliang Liu, Peng Shu, Jin Sun, Ramviyas Parasuraman, Tianming Liu

TL;DR

Experimental results indicate this framework outperforms existing models in both success rates and adaptability to new environments through improvements in the accuracy and reliability of visual grasping actions under a variety of conditions.

Abstract

To tackle the "reality gap" encountered in Sim-to-Real transfer, this study proposes a diffusion-based framework that minimizes inconsistencies in grasping actions between the simulation settings and realistic environments. The process begins by training an adversarial supervision layout-to-image diffusion model(ALDM). Then, leverage the ALDM approach to enhance the simulation environment, rendering it with photorealistic fidelity, thereby optimizing robotic grasp task training. Experimental results indicate this framework outperforms existing models in both success rates and adaptability to new environments through improvements in the accuracy and reliability of visual grasping actions under a variety of conditions. Specifically, it achieves a 75\% success rate in grasping tasks under plain backgrounds and maintains a 65\% success rate in more complex scenarios. This performance demonstrates this framework excels at generating controlled image content based on text descriptions, identifying object grasp points, and demonstrating zero-shot learning in complex, unseen scenarios.

ALDM-Grasping: Diffusion-aided Zero-Shot Sim-to-Real Transfer for Robot Grasping

TL;DR

Abstract

Paper Structure (18 sections, 4 equations, 4 figures, 2 tables)

This paper contains 18 sections, 4 equations, 4 figures, 2 tables.

INTRODUCTION
RELATED WORK
Visual Grasping
Image Generation Models for Bridging Reality Gap
GANs Models
Diffusion Models
Robotic Application
METHODS
Robot Training and Robotic Control Pipeline
Layout-to-Image Diffusion Model
Object Detection and Grasping
EVALUATIONS
Datasets and Simulation Environment
The training datasets of the image generation model
Experimental Dataset
...and 3 more sections

Figures (4)

Figure 1: ALDM has great performance in zero-shot image generation and style transfer. The image generation model used in this article contains data of various types and can generate objects in various robotic application scenarios. The generated image is not only close to the realistic-style scene but also ensures that the number and position of key objects are consistent with the original image. There is huge potential to provide more diverse, realistic, and accurate training samples for robot action planning applications.
Figure 2: The whole pipeline of the diffusion-model-based grasping robot. a) The generation procedures for realistic style images. b) The robot agent network designed for the task of object detection and grasping.
Figure 3: Experiment results in image generation. In the Gazebo column, we constructed five simulation scenarios in the Gazebo framework, meticulously designed to align with the physical layout of the laboratory environment. In the Segmentation column, the outcomes of object segmentation, derived from the simulated settings, serve as the inputs for the layout-to-diffusion model, delineated through predefined labels. In the ControlNet column, the images show incorrect image contents and inaccurate object positions. In the ALDM column, it shows the output generated by the image synthesis model. Notably, the produced imagery not only exhibits precise localization of the target objects but also embodies a stylistic resemblance to the actual environmental setting, demonstrating the model's efficacy in bridging the gap between the simulated images' style and laboratory environment.
Figure 4: The object detection results of the physical robot. The first panel represents the detection performance of the simple background, and the second panel represents the complex background. It can be seen that the target objects are successfully detected while the background has lots of interference objects.

ALDM-Grasping: Diffusion-aided Zero-Shot Sim-to-Real Transfer for Robot Grasping

TL;DR

Abstract

ALDM-Grasping: Diffusion-aided Zero-Shot Sim-to-Real Transfer for Robot Grasping

Authors

TL;DR

Abstract

Table of Contents

Figures (4)