Table of Contents
Fetching ...

Learning Visual Parkour from Generated Images

Alan Yu, Ge Yang, Ran Choi, Yajvan Ravan, John Leonard, Phillip Isola

TL;DR

This work proposes a way to use generative models to synthesize diverse and physically accurate image sequences of the scene from the robot's ego-centric perspective and presents demonstrations of zero-shot transfer to the RGB-only observations of the real world on a robot equipped with a low-cost, off-the-shelf color camera.

Abstract

Fast and accurate physics simulation is an essential component of robot learning, where robots can explore failure scenarios that are difficult to produce in the real world and learn from unlimited on-policy data. Yet, it remains challenging to incorporate RGB-color perception into the sim-to-real pipeline that matches the real world in its richness and realism. In this work, we train a robot dog in simulation for visual parkour. We propose a way to use generative models to synthesize diverse and physically accurate image sequences of the scene from the robot's ego-centric perspective. We present demonstrations of zero-shot transfer to the RGB-only observations of the real world on a robot equipped with a low-cost, off-the-shelf color camera. website visit https://lucidsim.github.io

Learning Visual Parkour from Generated Images

TL;DR

This work proposes a way to use generative models to synthesize diverse and physically accurate image sequences of the scene from the robot's ego-centric perspective and presents demonstrations of zero-shot transfer to the RGB-only observations of the real world on a robot equipped with a low-cost, off-the-shelf color camera.

Abstract

Fast and accurate physics simulation is an essential component of robot learning, where robots can explore failure scenarios that are difficult to produce in the real world and learn from unlimited on-policy data. Yet, it remains challenging to incorporate RGB-color perception into the sim-to-real pipeline that matches the real world in its richness and realism. In this work, we train a robot dog in simulation for visual parkour. We propose a way to use generative models to synthesize diverse and physically accurate image sequences of the scene from the robot's ego-centric perspective. We present demonstrations of zero-shot transfer to the RGB-only observations of the real world on a robot equipped with a low-cost, off-the-shelf color camera. website visit https://lucidsim.github.io

Paper Structure

This paper contains 34 sections, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Learning a real-world policy from generated images. Left: we generate diverse and on-policy visual data by combining structured image prompts with geometric and semantic control from an underlying physics simulator. Right: the policy is sufficiently robust to transfer to a variety of challenging terrains in the real world, despite never having seen real data during training.
  • Figure 2: The LucidSim graphics pipeline. We use the same parameterized terrain geometry as cheng2023parkour. We use MuJoCo to simulate the physics, and render semantic masks and the depth image that are then fed into a ControlNet trained with MiDAS depth maps. The generated image is then combined with the dense optical flow to generate short videos via Dreams In Motion (DIM, see Sec. \ref{['sec:geometry-physics-guidance']}).
  • Figure 3: We solicit batches of $20\sim30$ image prompts in JSON format from chatGPT. Each task requires $\sim10^3$ generated prompts.
  • Figure 4: CLIP embeddings of images generated by three sets of meta-prompts. Images from the same prompt ($\color{red}\circ$) are not diverse.
  • Figure 5: LucidSim image samples from the stairs environment. Top row: images generated from different prompts produced by the same meta prompt; bottom row: different meta prompts.
  • ...and 14 more figures