Table of Contents
Fetching ...

Fast Policy Learning through Imitation and Reinforcement

Ching-An Cheng, Xinyan Yan, Nolan Wagener, Byron Boots

TL;DR

The paper tackles data-inefficiency and suboptimal expert issues in policy learning by unifying RL and IL under a mirror-descent framework. It introduces Loki, a two-phase algorithm that performs a randomized, short imitation phase before switching to policy-gradient RL, yielding faster convergence and the potential to outperform suboptimal experts. The authors provide theoretical guarantees showing Loki mimics direct policy-gradient behavior from the expert under proper randomization, and they corroborate these results with experiments across locomotion, pendulum, and manipulation tasks. The work offers a practical, theoretically grounded approach for combining imitation and reinforcement learning with strong empirical performance improvements over existing hybrids.

Abstract

Imitation learning (IL) consists of a set of tools that leverage expert demonstrations to quickly learn policies. However, if the expert is suboptimal, IL can yield policies with inferior performance compared to reinforcement learning (RL). In this paper, we aim to provide an algorithm that combines the best aspects of RL and IL. We accomplish this by formulating several popular RL and IL algorithms in a common mirror descent framework, showing that these algorithms can be viewed as a variation on a single approach. We then propose LOKI, a strategy for policy learning that first performs a small but random number of IL iterations before switching to a policy gradient RL method. We show that if the switching time is properly randomized, LOKI can learn to outperform a suboptimal expert and converge faster than running policy gradient from scratch. Finally, we evaluate the performance of LOKI experimentally in several simulated environments.

Fast Policy Learning through Imitation and Reinforcement

TL;DR

The paper tackles data-inefficiency and suboptimal expert issues in policy learning by unifying RL and IL under a mirror-descent framework. It introduces Loki, a two-phase algorithm that performs a randomized, short imitation phase before switching to policy-gradient RL, yielding faster convergence and the potential to outperform suboptimal experts. The authors provide theoretical guarantees showing Loki mimics direct policy-gradient behavior from the expert under proper randomization, and they corroborate these results with experiments across locomotion, pendulum, and manipulation tasks. The work offers a practical, theoretically grounded approach for combining imitation and reinforcement learning with strong empirical performance improvements over existing hybrids.

Abstract

Imitation learning (IL) consists of a set of tools that leverage expert demonstrations to quickly learn policies. However, if the expert is suboptimal, IL can yield policies with inferior performance compared to reinforcement learning (RL). In this paper, we aim to provide an algorithm that combines the best aspects of RL and IL. We accomplish this by formulating several popular RL and IL algorithms in a common mirror descent framework, showing that these algorithms can be viewed as a variation on a single approach. We then propose LOKI, a strategy for policy learning that first performs a small but random number of IL iterations before switching to a policy gradient RL method. We show that if the switching time is properly randomized, LOKI can learn to outperform a suboptimal expert and converge faster than running policy gradient from scratch. Finally, we evaluate the performance of LOKI experimentally in several simulated environments.

Paper Structure

This paper contains 37 sections, 9 theorems, 57 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Lemma 1

kakade2002approximately Let $\pi$ and $\pi'$ be two policies and $A_{\pi'}(s, a) = Q_{\pi'}(s,a) - V_{\pi'}(s)$ be the (dis)advantage function with respect to running $\pi'$. Then it holds that

Figures (1)

  • Figure 1: Learning curves. Shaded regions correspond to $\pm \frac{1}{2}$-standard deviation.

Theorems & Definitions (11)

  • Lemma 1
  • Proposition 1
  • Proposition 2
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 2
  • proof
  • Lemma 3
  • Lemma 4
  • ...and 1 more