Table of Contents
Fetching ...

Deterministic Exploration via Stationary Bellman Error Maximization

Sebastian Griesbach, Carlo D'Eramo

TL;DR

The separate exploration agent is informed about the state of the exploitation, thus enabling it to account for previous experiences and to make the exploration objective agnostic toward the episode length and to mitigate instability introduced by far-off-policy learning.

Abstract

Exploration is a crucial and distinctive aspect of reinforcement learning (RL) that remains a fundamental open problem. Several methods have been proposed to tackle this challenge. Commonly used methods inject random noise directly into the actions, indirectly via entropy maximization, or add intrinsic rewards that encourage the agent to steer to novel regions of the state space. Another previously seen idea is to use the Bellman error as a separate optimization objective for exploration. In this paper, we introduce three modifications to stabilize the latter and arrive at a deterministic exploration policy. Our separate exploration agent is informed about the state of the exploitation, thus enabling it to account for previous experiences. Further components are introduced to make the exploration objective agnostic toward the episode length and to mitigate instability introduced by far-off-policy learning. Our experimental results show that our approach can outperform $\varepsilon$-greedy in dense and sparse reward settings.

Deterministic Exploration via Stationary Bellman Error Maximization

TL;DR

The separate exploration agent is informed about the state of the exploitation, thus enabling it to account for previous experiences and to make the exploration objective agnostic toward the episode length and to mitigate instability introduced by far-off-policy learning.

Abstract

Exploration is a crucial and distinctive aspect of reinforcement learning (RL) that remains a fundamental open problem. Several methods have been proposed to tackle this challenge. Commonly used methods inject random noise directly into the actions, indirectly via entropy maximization, or add intrinsic rewards that encourage the agent to steer to novel regions of the state space. Another previously seen idea is to use the Bellman error as a separate optimization objective for exploration. In this paper, we introduce three modifications to stabilize the latter and arrive at a deterministic exploration policy. Our separate exploration agent is informed about the state of the exploitation, thus enabling it to account for previous experiences. Further components are introduced to make the exploration objective agnostic toward the episode length and to mitigate instability introduced by far-off-policy learning. Our experimental results show that our approach can outperform -greedy in dense and sparse reward settings.

Paper Structure

This paper contains 20 sections, 7 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Comparison between $\varepsilon$-greedy and SEE, the x-axis in $10,000$ steps. The line shows the mean performance across $50$ runs, the shaded area shows the standard error of the mean.
  • Figure 2: Comparison of SEE and modified versions where one of the components is replaced or left out. The x-axis is in steps of $10,000$. For this ablation, only the PredictableLunarLander-v0 environment has been used. The line shows the mean performance across $10$ runs, the shaded area shows the standard error of the mean.