Table of Contents
Fetching ...

Learning-Driven Exploration for Reinforcement Learning

Muhammad Usama, Dong Eui Chang

TL;DR

EBE introduces an entropy-based exploration mechanism that adapts exploration to the agent's learning progress by measuring state-specific action-value uncertainty. It defines a state-conditioned action distribution via a stabilized softmax over Q-values and uses the resulting entropy H(s) as the probability to explore in that state, replacing fixed ε in ε-greedy. The paper demonstrates faster and more data-efficient learning across a range of tasks (linear, Breakout, VizDoom, pendulum) and compares favorably to count-based methods, without tuning hyperparameters. It also provides thorough implementation details and releases code for reproducibility.

Abstract

Effective and intelligent exploration has been an unresolved problem for reinforcement learning. Most contemporary reinforcement learning relies on simple heuristic strategies such as $ε$-greedy exploration or adding Gaussian noise to actions. These heuristics, however, are unable to intelligently distinguish the well explored and the unexplored regions of state space, which can lead to inefficient use of training time. We introduce entropy-based exploration (EBE) that enables an agent to explore efficiently the unexplored regions of state space. EBE quantifies the agent's learning in a state using merely state-dependent action values and adaptively explores the state space, i.e. more exploration for the unexplored region of the state space. We perform experiments on a diverse set of environments and demonstrate that EBE enables efficient exploration that ultimately results in faster learning without having to tune any hyperparameter. The code to reproduce the experiments is given at \url{https://github.com/Usama1002/EBE-Exploration} and the supplementary video is given at \url{https://youtu.be/nJggIjjzKic}.

Learning-Driven Exploration for Reinforcement Learning

TL;DR

EBE introduces an entropy-based exploration mechanism that adapts exploration to the agent's learning progress by measuring state-specific action-value uncertainty. It defines a state-conditioned action distribution via a stabilized softmax over Q-values and uses the resulting entropy H(s) as the probability to explore in that state, replacing fixed ε in ε-greedy. The paper demonstrates faster and more data-efficient learning across a range of tasks (linear, Breakout, VizDoom, pendulum) and compares favorably to count-based methods, without tuning hyperparameters. It also provides thorough implementation details and releases code for reproducibility.

Abstract

Effective and intelligent exploration has been an unresolved problem for reinforcement learning. Most contemporary reinforcement learning relies on simple heuristic strategies such as -greedy exploration or adding Gaussian noise to actions. These heuristics, however, are unable to intelligently distinguish the well explored and the unexplored regions of state space, which can lead to inefficient use of training time. We introduce entropy-based exploration (EBE) that enables an agent to explore efficiently the unexplored regions of state space. EBE quantifies the agent's learning in a state using merely state-dependent action values and adaptively explores the state space, i.e. more exploration for the unexplored region of the state space. We perform experiments on a diverse set of environments and demonstrate that EBE enables efficient exploration that ultimately results in faster learning without having to tune any hyperparameter. The code to reproduce the experiments is given at \url{https://github.com/Usama1002/EBE-Exploration} and the supplementary video is given at \url{https://youtu.be/nJggIjjzKic}.

Paper Structure

This paper contains 28 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Conceptual visualization of entropy-based exploration (EBE).
  • Figure 2: Plot of (a) mean entropy $H_{o}$, given in equation \ref{['eq:mean_entropy']}, and (b) accumulative episode reward for trained, partially trained and untrained agents for 10 test episodes. The agents are trained to play VizDoom game Seek and Destroy.
  • Figure 3: (a) Simple linear environment consists of 21 states. Episode starts in state $s=10$, shown in red circle. States $s=0$ and $s=20$, shown in green rounded rectangles, are terminal states. For non-terminal states, the agent can transition into either of its neighboring states. The agent gets reward $r=1$ for transitioning into the terminal states and zero reward otherwise. (b) Squared Error loss for value iteration task on linear environment.
  • Figure 4: Plots show (a) test episode scores and (b) training episode scores for agents trained with EBE, $\epsilon$-greedy exploration and Boltzmann exploration on breakout game. Likewise, (c) plots mean test score of 100 test episode scores played after each training epoch and (d) plots mean score of all training episodes played in a training epoch for VizDoom game Seek and Destroy. Smoothed data is shown with solid lines while unsmoothed data is ghosted in the background. Smoothing method is adopted from tensorboard with weight 0.99.
  • Figure 5: (a) plots mean test reward of 100 test episodes played after each training epoch while (b) plots mean training reward of all training episode per epoch for for game DTC. (c) plots mean test reward of 100 test episodes played after each training epoch while (d) plots mean training reward of all training episode per epoch for for game DTL. We compare EBE with $\epsilon$-greedy and Boltzmann exploration strategies. Plots show smoothed data while unsmoothed data is ghosted in the background. Smoothing method is adopted from tensorboard with weight 0.975.
  • ...and 2 more figures