Table of Contents
Fetching ...

Value of Information-Enhanced Exploration in Bootstrapped DQN

Stergios Plataniotis, Charilaos Akasiadis, Georgios Chalkiadakis

TL;DR

Exploration in deep reinforcement learning remains challenging in sparse-reward environments. The authors introduce two EVOI-based extensions to Bootstrapped DQN, BootDQN-Gain and BootDQN-EVOI, which quantify information gain from head disagreements and integrate it into action selection. They demonstrate improved performance on several hard Atari games and increased diversity among ensemble heads, without adding extra hyperparameters. The approach offers a principled path to more efficient exploration by leveraging the information value of actions across an ensemble. Overall, the work suggests a viable direction for information-driven exploration in high-dimensional DRL.

Abstract

Efficient exploration in deep reinforcement learning remains a fundamental challenge, especially in environments characterized by high-dimensional states and sparse rewards. Traditional exploration strategies that rely on random local policy noise, such as $ε$-greedy and Boltzmann exploration methods, often struggle to efficiently balance exploration and exploitation. In this paper, we integrate the notion of (expected) value of information (EVOI) within the well-known Bootstrapped DQN algorithmic framework, to enhance the algorithm's deep exploration ability. Specifically, we develop two novel algorithms that incorporate the expected gain from learning the value of information into Bootstrapped DQN. Our methods use value of information estimates to measure the discrepancies of opinions among distinct network heads, and drive exploration towards areas with the most potential. We evaluate our algorithms with respect to performance and their ability to exploit inherent uncertainty arising from random network initialization. Our experiments in complex, sparse-reward Atari games demonstrate increased performance, all the while making better use of uncertainty, and, importantly, without introducing extra hyperparameters.

Value of Information-Enhanced Exploration in Bootstrapped DQN

TL;DR

Exploration in deep reinforcement learning remains challenging in sparse-reward environments. The authors introduce two EVOI-based extensions to Bootstrapped DQN, BootDQN-Gain and BootDQN-EVOI, which quantify information gain from head disagreements and integrate it into action selection. They demonstrate improved performance on several hard Atari games and increased diversity among ensemble heads, without adding extra hyperparameters. The approach offers a principled path to more efficient exploration by leveraging the information value of actions across an ensemble. Overall, the work suggests a viable direction for information-driven exploration in high-dimensional DRL.

Abstract

Efficient exploration in deep reinforcement learning remains a fundamental challenge, especially in environments characterized by high-dimensional states and sparse rewards. Traditional exploration strategies that rely on random local policy noise, such as -greedy and Boltzmann exploration methods, often struggle to efficiently balance exploration and exploitation. In this paper, we integrate the notion of (expected) value of information (EVOI) within the well-known Bootstrapped DQN algorithmic framework, to enhance the algorithm's deep exploration ability. Specifically, we develop two novel algorithms that incorporate the expected gain from learning the value of information into Bootstrapped DQN. Our methods use value of information estimates to measure the discrepancies of opinions among distinct network heads, and drive exploration towards areas with the most potential. We evaluate our algorithms with respect to performance and their ability to exploit inherent uncertainty arising from random network initialization. Our experiments in complex, sparse-reward Atari games demonstrate increased performance, all the while making better use of uncertainty, and, importantly, without introducing extra hyperparameters.

Paper Structure

This paper contains 12 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Example of the effect of different action-selection methods. Left: a specific state of the game in Pacman. Right: a barplot that depicts for each action the BootDQN-computed $Q$-values (blue part of each bar), augmented with either $gain$ (left bars, $gain$ contribution in green) or EVOI (right bars, EVOI contribution in red). Horizontal dashed lines illustrate the maximum values for each method, depicted in corresponding colors. BootDQN, in this state, would choose action DOWN; BootDQN-Gain would choose LEFT; and BootDQN-EVOI would choose UP. This showcases the effect of our methods.
  • Figure 2: Evaluation rewards (subfigure (a)) and mean variance in voting (subfigure (b)) during evaluation periods, for BootDQN (our implementation), BootDQN-UCB, BootDQN-Gain, and BootDQN-EVOI. All plots show moving averages using a sliding window of the 10 most recent values, with shaded areas representing the 95% confidence intervals for each algorithm. In terms of performance, BootDQN-EVOI outperforms or matches its counterparts in all games. For clarity, mini-plots are included in cluttered plots in (a) to magnify results for the last 5M frames.
  • Figure 3: Mean evaluation reward of BootDQN-EVOI as training progresses, for different number of heads $K$ in Hero. Performance improves as $K$ increases.
  • Figure 4: Example instance of a DeepSea problem of size $N$ ($N$x$N$ grid). The agent starts each episode at the top leftmost cell and at each timestep descends one row and can either move left or right. The goal is to reach the bottom rightmost cell and perform the action right. The agent receives a small negative reward for choosing right and a substantially positive reward for reaching the goal. Solid arrow represents the optimal policy ($\sum_tr_t=0.99$) and dashed arrow represents the second-best -- suboptimal -- policy ($\sum_tr_t=0$).
  • Figure 5: Average number of episodes needed, for each algorithm, to solve the corresponding DeepSea instance over $15$ seeds. Shaded areas visualize the 95% confidence intervals. Both of our methods outperform the baselines. BootDQN-EVOI in particular scales the best as the size of the problem (i.e., the $N$ of the $N \times N$ grid) increases.