Self-Imitation Learning
Junhyuk Oh, Yijie Guo, Satinder Singh, Honglak Lee
TL;DR
Self-Imitation Learning (SIL) is introduced as a simple off-policy actor-critic mechanism that stores and imitates the agent's past high-return decisions to drive deeper exploration. The authors provide a theoretical explanation via lower-bound soft Q-learning within entropy-regularized RL, and show SIL can be combined with A2C and PPO across diverse tasks. Empirically, SIL improves performance on hard-exploration Atari games and MuJoCo tasks, sometimes outperforming or complementing count-based exploration methods. The work demonstrates SIL's general applicability to actor-critic architectures and highlights its potential to balance exploitation of past successes with ongoing exploration.
Abstract
This paper proposes Self-Imitation Learning (SIL), a simple off-policy actor-critic algorithm that learns to reproduce the agent's past good decisions. This algorithm is designed to verify our hypothesis that exploiting past good experiences can indirectly drive deep exploration. Our empirical results show that SIL significantly improves advantage actor-critic (A2C) on several hard exploration Atari games and is competitive to the state-of-the-art count-based exploration methods. We also show that SIL improves proximal policy optimization (PPO) on MuJoCo tasks.
