Table of Contents
Fetching ...

RLIF: Interactive Imitation Learning as Reinforcement Learning

Jianlan Luo, Perry Dong, Yuexiang Zhai, Yi Ma, Sergey Levine

TL;DR

This paper explores how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning, and proposes a proposed method that uses reinforcement learning with user intervention signals themselves as rewards.

Abstract

Although reinforcement learning methods offer a powerful framework for automatic skill acquisition, for practical learning-based control problems in domains such as robotics, imitation learning often provides a more convenient and accessible alternative. In particular, an interactive imitation learning method such as DAgger, which queries a near-optimal expert to intervene online to collect correction data for addressing the distributional shift challenges that afflict naïve behavioral cloning, can enjoy good performance both in theory and practice without requiring manually specified reward functions and other components of full reinforcement learning methods. In this paper, we explore how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning. Our proposed method uses reinforcement learning with user intervention signals themselves as rewards. This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert. We also provide a unified framework to analyze our RL method and DAgger; for which we present the asymptotic analysis of the suboptimal gap for both methods as well as the non-asymptotic sample complexity bound of our method. We then evaluate our method on challenging high-dimensional continuous control simulation benchmarks as well as real-world robotic vision-based manipulation tasks. The results show that it strongly outperforms DAgger-like approaches across the different tasks, especially when the intervening experts are suboptimal. Code and videos can be found on the project website: https://rlif-page.github.io

RLIF: Interactive Imitation Learning as Reinforcement Learning

TL;DR

This paper explores how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning, and proposes a proposed method that uses reinforcement learning with user intervention signals themselves as rewards.

Abstract

Although reinforcement learning methods offer a powerful framework for automatic skill acquisition, for practical learning-based control problems in domains such as robotics, imitation learning often provides a more convenient and accessible alternative. In particular, an interactive imitation learning method such as DAgger, which queries a near-optimal expert to intervene online to collect correction data for addressing the distributional shift challenges that afflict naïve behavioral cloning, can enjoy good performance both in theory and practice without requiring manually specified reward functions and other components of full reinforcement learning methods. In this paper, we explore how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning. Our proposed method uses reinforcement learning with user intervention signals themselves as rewards. This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert. We also provide a unified framework to analyze our RL method and DAgger; for which we present the asymptotic analysis of the suboptimal gap for both methods as well as the non-asymptotic sample complexity bound of our method. We then evaluate our method on challenging high-dimensional continuous control simulation benchmarks as well as real-world robotic vision-based manipulation tasks. The results show that it strongly outperforms DAgger-like approaches across the different tasks, especially when the intervening experts are suboptimal. Code and videos can be found on the project website: https://rlif-page.github.io
Paper Structure (46 sections, 9 theorems, 41 equations, 11 figures, 6 tables, 2 algorithms)

This paper contains 46 sections, 9 theorems, 41 equations, 11 figures, 6 tables, 2 algorithms.

Key Result

Theorem 6.3

Let $\tilde{\pi}\in\Pi^\mathrm{opt}_{\delta}$ denote an optimal policy from maximizing the reward function $\tilde{r}_\delta$ generated by RLIF. Let ${{\epsilon} = \max\left\{ \mathbb E_{s\sim d_\mu^{\tilde{\pi}}}\ell(s,\pi(s)),\mathbb E_{s\sim d_\mu^{\tilde{\pi}}}\ell'(s,\pi(s)) \right\}}$ (Def. de

Figures (11)

  • Figure 1: RLIF uses RL to learn without ground truth rewards, with data collected with suboptimal human interventions.
  • Figure 2: A human operator supervises policy training and provides intervention with a 3D mouse.
  • Figure 3: Average success rate and intervention rate for the Adroit-Pen task during training, as the agent improves, the intervention decreases.
  • Figure 4: Tasks in our experimental evaluation: Benchmark tasks Walker2d, Pen, and Hopper and two vision-based contact-rich manipulation tasks on a real robot. The benchmark tasks require handling complex high-dimensional dynamics and underactuation. The robotic insertion task requires additionally addressing complex inputs such as images, non-differentiable dynamics such as contact, and all sensor noise associated with real-world robotic settings.
  • Figure 5: Sequential steps of robot manipulation for the peg insertion and cloth unfolding tasks on a real robot.
  • ...and 6 more figures

Theorems & Definitions (18)

  • Definition 6.2: Behavior Cloning Loss ross2011reduction
  • Theorem 6.3: Suboptimality Gap of RLIF
  • Corollary 6.4: Suboptimality Gap of DAgger
  • Theorem B.1: Suboptimality Gap of RLIF, Thm. \ref{['thm:subopt-gap-main']} restated
  • proof
  • Corollary B.2: Suboptimality Gap of DAgger, Cor. \ref{['cor:subopt-gap-dagger-main']} restated
  • proof
  • Example B.3: Lower Bounds of RLIF
  • proof
  • Definition C.1: Single-Policy Concentrability Coefficient
  • ...and 8 more