Table of Contents
Fetching ...

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL

Andrew Wagenmaker, Kevin Huang, Liyiming Ke, Byron Boots, Kevin Jamieson, Abhishek Gupta

TL;DR

This work shows that in many regimes, while direct sim2real transfer may fail, the simulator can be utilized to learn a set of exploratory policies which enable efficient exploration in the real world, and is the first evidence that simulation transfer yields a provable gain in reinforcement learning in settings where direct sim2real transfer fails.

Abstract

In order to mitigate the sample complexity of real-world reinforcement learning, common practice is to first train a policy in a simulator where samples are cheap, and then deploy this policy in the real world, with the hope that it generalizes effectively. Such \emph{direct sim2real} transfer is not guaranteed to succeed, however, and in cases where it fails, it is unclear how to best utilize the simulator. In this work, we show that in many regimes, while direct sim2real transfer may fail, we can utilize the simulator to learn a set of \emph{exploratory} policies which enable efficient exploration in the real world. In particular, in the setting of low-rank MDPs, we show that coupling these exploratory policies with simple, practical approaches -- least-squares regression oracles and naive randomized exploration -- yields a polynomial sample complexity in the real world, an exponential improvement over direct sim2real transfer, or learning without access to a simulator. To the best of our knowledge, this is the first evidence that simulation transfer yields a provable gain in reinforcement learning in settings where direct sim2real transfer fails. We validate our theoretical results on several realistic robotic simulators and a real-world robotic sim2real task, demonstrating that transferring exploratory policies can yield substantial gains in practice as well.

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL

TL;DR

This work shows that in many regimes, while direct sim2real transfer may fail, the simulator can be utilized to learn a set of exploratory policies which enable efficient exploration in the real world, and is the first evidence that simulation transfer yields a provable gain in reinforcement learning in settings where direct sim2real transfer fails.

Abstract

In order to mitigate the sample complexity of real-world reinforcement learning, common practice is to first train a policy in a simulator where samples are cheap, and then deploy this policy in the real world, with the hope that it generalizes effectively. Such \emph{direct sim2real} transfer is not guaranteed to succeed, however, and in cases where it fails, it is unclear how to best utilize the simulator. In this work, we show that in many regimes, while direct sim2real transfer may fail, we can utilize the simulator to learn a set of \emph{exploratory} policies which enable efficient exploration in the real world. In particular, in the setting of low-rank MDPs, we show that coupling these exploratory policies with simple, practical approaches -- least-squares regression oracles and naive randomized exploration -- yields a polynomial sample complexity in the real world, an exponential improvement over direct sim2real transfer, or learning without access to a simulator. To the best of our knowledge, this is the first evidence that simulation transfer yields a provable gain in reinforcement learning in settings where direct sim2real transfer fails. We validate our theoretical results on several realistic robotic simulators and a real-world robotic sim2real task, demonstrating that transferring exploratory policies can yield substantial gains in practice as well.

Paper Structure

This paper contains 52 sections, 25 theorems, 181 equations, 11 figures, 2 tables, 6 algorithms.

Key Result

Proposition 1

For any $H > 1$, $\zeta \in [0,1]$, and $c \le 1/6$, there exist some $\mathcal{M}^{\mathsf{real},1}$ and $\mathcal{M}^{\mathsf{real},2}$ such that both $\mathcal{M}^{\mathsf{real},1}$ and $\mathcal{M}^{\mathsf{real},2}$ satisfy asm:mdp_structureasm:bellman_completeness, and unless $T \ge \Omega(2^{

Figures (11)

  • Figure 1: Left: Overview of our approach compared to standard $\mathsf{sim}\mathsf{2}\mathsf{real}$ transfer on puck pushing task. Standard $\mathsf{sim}\mathsf{2}\mathsf{real}$ transfer first trains a policy to solve the goal task in sim and then transfers this policy to real. This policy may fail to solve the task in real due to the $\mathsf{sim}\mathsf{2}\mathsf{real}$ gap, and furthermore may not provide sufficient data coverage to successfully learn a policy that does solve the goal task in real. In contrast, our approach trains a set of exploratory policies in sim which achieve high-coverage data when deployed in real, even if they are unable to solve the task 0-shot. This high-coverage data can then be used to successfully learn a policy that solves the goal task in real. Right: Quantitative results running our approach on the puck pushing task illustrated on left, compared to standard $\mathsf{sim}\mathsf{2}\mathsf{real}$ transfer. Over 6 real-world trials, our approach solves the task 6/6 times while standard $\mathsf{sim}\mathsf{2}\mathsf{real}$ transfer solves the task 0/6 times.
  • Figure 2: Illustration of Didactic Example (\ref{['prop:direct_transfer_subopt']})
  • Figure 3: TychoEnv Reach Task Setup
  • Figure 4: Franka Hammering Task Setup
  • Figure 5: Results on $\mathsf{sim2sim}$ Combination Lock Example
  • ...and 6 more figures

Theorems & Definitions (48)

  • Definition 3.1: Low-Rank MDP
  • Definition 3.2: PAC Reinforcement Learning
  • Proposition 1
  • Proposition 2: Simulation Lemma
  • Proposition 3
  • Theorem 1
  • Proposition 4
  • Proposition 5
  • Remark 4.1: Computational Efficiency
  • Lemma A.1
  • ...and 38 more