Table of Contents
Fetching ...

LESSON: Learning to Integrate Exploration Strategies for Reinforcement Learning via an Option Framework

Woojun Kim, Jeonghye Kim, Youngchul Sung

Abstract

In this paper, a unified framework for exploration in reinforcement learning (RL) is proposed based on an option-critic model. The proposed framework learns to integrate a set of diverse exploration strategies so that the agent can adaptively select the most effective exploration strategy over time to realize a relevant exploration-exploitation trade-off for each given task. The effectiveness of the proposed exploration framework is demonstrated by various experiments in the MiniGrid and Atari environments.

LESSON: Learning to Integrate Exploration Strategies for Reinforcement Learning via an Option Framework

Abstract

In this paper, a unified framework for exploration in reinforcement learning (RL) is proposed based on an option-critic model. The proposed framework learns to integrate a set of diverse exploration strategies so that the agent can adaptively select the most effective exploration strategy over time to realize a relevant exploration-exploitation trade-off for each given task. The effectiveness of the proposed exploration framework is demonstrated by various experiments in the MiniGrid and Atari environments.
Paper Structure (27 sections, 11 equations, 16 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 11 equations, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: Overall diagram of LESSON: The blue box shows the behavior policy realized by the proposed option model. The option selection policy $\pi_\Omega$ selects an intra-policy and the corresponding termination function. The target policy denoted by the red box is trained using the samples generated by the behavior policy.
  • Figure 2: Performance comparison on the MiniGrid tasks. More results are provided in Appendix \ref{['sec:appx-experimental-results']}.
  • Figure 3: Performance comparison on the Atari 2600 tasks
  • Figure 4: Comparison of LESSON with the baselines in the Empty-16x16 environment with the goal at the right lower corner: (a) the view of environment, (b) performance comparison, (c) the termination probabilities $\beta_\omega$ over time for LESSON, and (d) state visitation frequency. (Fig.4(a) was obtained by rendering the MiniGrid Empty-16x16 environment while training gym_MiniGrid.)
  • Figure 5: Option selection policy and termination probability during training.
  • ...and 11 more figures