Table of Contents
Fetching ...

RetinaGAN: An Object-aware Approach to Sim-to-Real Transfer

Daniel Ho, Kanishka Rao, Zhuo Xu, Eric Jang, Mohi Khansari, Yunfei Bai

TL;DR

RetinaGAN introduces an object-aware sim-to-real adaptation framework that enforces perception consistency by coupling a frozen object detector with CycleGAN-style image translation. Its L_prcp perception loss, powered by a novel Focal Consistency Loss, preserves object structure and textures while translating simulated scenes to realistic visuals for policy training. Across grasping, pushing, and door-opening tasks, RetinaGAN yields substantial data-efficient gains, enabling high real-world performance with limited real data and enabling transfer to related tasks without additional real data. The approach demonstrates robust cross-domain generalization, reusing a single detector across tasks and achieving state-of-the-art-like performance in several setups, including ensemble variants for imitation learning. This work highlights a practical path to reduce real-world data collection costs in robotic learning by decoupling perception from task-specific training and leveraging pre-trained detectors for semantic consistency.

Abstract

The success of deep reinforcement learning (RL) and imitation learning (IL) in vision-based robotic manipulation typically hinges on the expense of large scale data collection. With simulation, data to train a policy can be collected efficiently at scale, but the visual gap between sim and real makes deployment in the real world difficult. We introduce RetinaGAN, a generative adversarial network (GAN) approach to adapt simulated images to realistic ones with object-detection consistency. RetinaGAN is trained in an unsupervised manner without task loss dependencies, and preserves general object structure and texture in adapted images. We evaluate our method on three real world tasks: grasping, pushing, and door opening. RetinaGAN improves upon the performance of prior sim-to-real methods for RL-based object instance grasping and continues to be effective even in the limited data regime. When applied to a pushing task in a similar visual domain, RetinaGAN demonstrates transfer with no additional real data requirements. We also show our method bridges the visual gap for a novel door opening task using imitation learning in a new visual domain. Visit the project website at https://retinagan.github.io/

RetinaGAN: An Object-aware Approach to Sim-to-Real Transfer

TL;DR

RetinaGAN introduces an object-aware sim-to-real adaptation framework that enforces perception consistency by coupling a frozen object detector with CycleGAN-style image translation. Its L_prcp perception loss, powered by a novel Focal Consistency Loss, preserves object structure and textures while translating simulated scenes to realistic visuals for policy training. Across grasping, pushing, and door-opening tasks, RetinaGAN yields substantial data-efficient gains, enabling high real-world performance with limited real data and enabling transfer to related tasks without additional real data. The approach demonstrates robust cross-domain generalization, reusing a single detector across tasks and achieving state-of-the-art-like performance in several setups, including ensemble variants for imitation learning. This work highlights a practical path to reduce real-world data collection costs in robotic learning by decoupling perception from task-specific training and leveraging pre-trained detectors for semantic consistency.

Abstract

The success of deep reinforcement learning (RL) and imitation learning (IL) in vision-based robotic manipulation typically hinges on the expense of large scale data collection. With simulation, data to train a policy can be collected efficiently at scale, but the visual gap between sim and real makes deployment in the real world difficult. We introduce RetinaGAN, a generative adversarial network (GAN) approach to adapt simulated images to realistic ones with object-detection consistency. RetinaGAN is trained in an unsupervised manner without task loss dependencies, and preserves general object structure and texture in adapted images. We evaluate our method on three real world tasks: grasping, pushing, and door opening. RetinaGAN improves upon the performance of prior sim-to-real methods for RL-based object instance grasping and continues to be effective even in the limited data regime. When applied to a pushing task in a similar visual domain, RetinaGAN demonstrates transfer with no additional real data requirements. We also show our method bridges the visual gap for a novel door opening task using imitation learning in a new visual domain. Visit the project website at https://retinagan.github.io/

Paper Structure

This paper contains 21 sections, 7 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of RetinaGAN pipeline. Left: Train RetinaGAN using pre-trained perception model to create a sim-to-real model. Right: Train the behavior policy model using the sim-to-real generated images. This policy can later be deployed in real.
  • Figure 2: Sim and real perception data used to train EfficientDet focuses on scenes of disposable objects encountered in recycling stations. The real dataset includes 44,000 such labeled images and 37,000 images of objects on desks. The simulated dataset includes 625,000 total images.
  • Figure 3: Diagram of RetinaGAN stages. The simulated image (top left) is transformed by the sim-to-real generator and subsequently by the real-to-sim generator. The perception loss enforces consistency on object detections from each image. The same pipeline occurs for the real image branch at the bottom.
  • Figure 4: Diagram of perception consistency loss computation. An EfficientDet object detector predicts boxes and classes. Consistency of predictions between images is captured by losses similar to those in object detection training.
  • Figure 5: Sampled, unpaired images for the grasping task at various scales translated with either the sim-to-real (left) or real-to-sim (right) generator. Compared to other methods, the sim-to-real RetinaGAN consistently preserves object textures and better reconstructs real features. The real-to-sim RetinaGAN is able to preserve all object structure in cluttered scenes, and it correctly translates details of the robot gripper and floor.
  • ...and 4 more figures