Table of Contents
Fetching ...

Self-Imitation Learning

Junhyuk Oh, Yijie Guo, Satinder Singh, Honglak Lee

TL;DR

Self-Imitation Learning (SIL) is introduced as a simple off-policy actor-critic mechanism that stores and imitates the agent's past high-return decisions to drive deeper exploration. The authors provide a theoretical explanation via lower-bound soft Q-learning within entropy-regularized RL, and show SIL can be combined with A2C and PPO across diverse tasks. Empirically, SIL improves performance on hard-exploration Atari games and MuJoCo tasks, sometimes outperforming or complementing count-based exploration methods. The work demonstrates SIL's general applicability to actor-critic architectures and highlights its potential to balance exploitation of past successes with ongoing exploration.

Abstract

This paper proposes Self-Imitation Learning (SIL), a simple off-policy actor-critic algorithm that learns to reproduce the agent's past good decisions. This algorithm is designed to verify our hypothesis that exploiting past good experiences can indirectly drive deep exploration. Our empirical results show that SIL significantly improves advantage actor-critic (A2C) on several hard exploration Atari games and is competitive to the state-of-the-art count-based exploration methods. We also show that SIL improves proximal policy optimization (PPO) on MuJoCo tasks.

Self-Imitation Learning

TL;DR

Self-Imitation Learning (SIL) is introduced as a simple off-policy actor-critic mechanism that stores and imitates the agent's past high-return decisions to drive deeper exploration. The authors provide a theoretical explanation via lower-bound soft Q-learning within entropy-regularized RL, and show SIL can be combined with A2C and PPO across diverse tasks. Empirically, SIL improves performance on hard-exploration Atari games and MuJoCo tasks, sometimes outperforming or complementing count-based exploration methods. The work demonstrates SIL's general applicability to actor-critic architectures and highlights its potential to balance exploitation of past successes with ongoing exploration.

Abstract

This paper proposes Self-Imitation Learning (SIL), a simple off-policy actor-critic algorithm that learns to reproduce the agent's past good decisions. This algorithm is designed to verify our hypothesis that exploiting past good experiences can indirectly drive deep exploration. Our empirical results show that SIL significantly improves advantage actor-critic (A2C) on several hard exploration Atari games and is competitive to the state-of-the-art count-based exploration methods. We also show that SIL improves proximal policy optimization (PPO) on MuJoCo tasks.

Paper Structure

This paper contains 28 sections, 12 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Learning curves on Montezuma's Revenge. (Left) The agent needs to pick up the key in order to open the door. Picking up the key gives a small reward. (Right) The baseline (A2C) often picks up the key as shown by the best episode reward in 100K steps (A2C (Best)), but it fails to consistently reproduce such an experience. In contrast, self-imitation learning (A2C+SIL) quickly learns to pick up the key as soon as the agent experiences it, which leads to the next source of reward (door).
  • Figure 2: Key-Door-Treasure domain. The agent should pick up the key (K) in order to open the door (D) and collect the treasure (T) to maximize the reward. In the Apple-Key-Door-Treasure domain (bottom), there are two apples (A) that give small rewards (+1). 'SIL' and 'EXP' represent our self-imitation learning and a count-based exploration method respectively.
  • Figure 3: Learning curves on hard exploration Atari games. X-axis and y-axis represent steps and average reward respectively.
  • Figure 4: Relative performance of A2C+SIL over A2C.
  • Figure 5: Performance on OpenAI Gym MuJoCo tasks (top row) and delayed-reward versions of them (bottom row). The learning curves are averaged over 10 random seeds.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Claim