Table of Contents
Fetching ...

Dealing with uncertainty: balancing exploration and exploitation in deep recurrent reinforcement learning

Valentina Zangirolami, Matteo Borrotti

TL;DR

It is shown that adaptive methods better approximate the trade-off between exploration and exploitation and, in general, Softmax and Max-Boltzmann strategies outperform epsilon-greedy techniques.

Abstract

Incomplete knowledge of the environment leads an agent to make decisions under uncertainty. One of the major dilemmas in Reinforcement Learning (RL) where an autonomous agent has to balance two contrasting needs in making its decisions is: exploiting the current knowledge of the environment to maximize the cumulative reward as well as exploring actions that allow improving the knowledge of the environment, hopefully leading to higher reward values (exploration-exploitation trade-off). Concurrently, another relevant issue regards the full observability of the states, which may not be assumed in all applications. For instance, when 2D images are considered as input in an RL approach used for finding the best actions within a 3D simulation environment. In this work, we address these issues by deploying and testing several techniques to balance exploration and exploitation trade-off on partially observable systems for predicting steering wheels in autonomous driving scenarios. More precisely, the final aim is to investigate the effects of using both adaptive and deterministic exploration strategies coupled with a Deep Recurrent Q-Network. Additionally, we adapted and evaluated the impact of a modified quadratic loss function to improve the learning phase of the underlying Convolutional Recurrent Neural Network. We show that adaptive methods better approximate the trade-off between exploration and exploitation and, in general, Softmax and Max-Boltzmann strategies outperform epsilon-greedy techniques.

Dealing with uncertainty: balancing exploration and exploitation in deep recurrent reinforcement learning

TL;DR

It is shown that adaptive methods better approximate the trade-off between exploration and exploitation and, in general, Softmax and Max-Boltzmann strategies outperform epsilon-greedy techniques.

Abstract

Incomplete knowledge of the environment leads an agent to make decisions under uncertainty. One of the major dilemmas in Reinforcement Learning (RL) where an autonomous agent has to balance two contrasting needs in making its decisions is: exploiting the current knowledge of the environment to maximize the cumulative reward as well as exploring actions that allow improving the knowledge of the environment, hopefully leading to higher reward values (exploration-exploitation trade-off). Concurrently, another relevant issue regards the full observability of the states, which may not be assumed in all applications. For instance, when 2D images are considered as input in an RL approach used for finding the best actions within a 3D simulation environment. In this work, we address these issues by deploying and testing several techniques to balance exploration and exploitation trade-off on partially observable systems for predicting steering wheels in autonomous driving scenarios. More precisely, the final aim is to investigate the effects of using both adaptive and deterministic exploration strategies coupled with a Deep Recurrent Q-Network. Additionally, we adapted and evaluated the impact of a modified quadratic loss function to improve the learning phase of the underlying Convolutional Recurrent Neural Network. We show that adaptive methods better approximate the trade-off between exploration and exploitation and, in general, Softmax and Max-Boltzmann strategies outperform epsilon-greedy techniques.
Paper Structure (25 sections, 22 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 22 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Convolutional Neural Network with LSTM layer. Each pre-processed input image was processed by three convolutional layers and the resulting activations were performed through LSTM layer. Q-values were estimated by dividing the latter output into advantage and value fully-connected layers respectively and combining them together.
  • Figure 2: Pre-processing of car's front camera images. The two-step pre-processing procedure is illustrated. Initially, each input image undergoes cropping at the top and bottom to concentrate the view of the road. Subsequently, downsampling is employed on the images before feeding them into the neural network.
  • Figure 3: AirSim NH environment. The Neighborhood Environment is shown through the AirSim simulation platform.
  • Figure 4: Comparison of the exploration strategies with D3RQN agent. The training curves show the average reward per 100 episodes for each value of the buffer size. The horizontal axis and the vertical axis indicate, respectively, the number of episodes and the average reward.
  • Figure 5: Comparison of $\epsilon$ values for $\epsilon$-greedy and Max-Boltzmann methods. The horizontal axis and the vertical axis indicate, respectively, the number of environment steps and the $\epsilon$ value. Figure (a) shows $\epsilon$ values per steps for deterministic $\epsilon$-greedy strategies. Figure (b) shows $\epsilon$ values per step for adaptive $\epsilon$-greedy strategies (BMC and VDBE) distinguished by buffer size. Figure (c) shows $\epsilon$ values per step for Max-Boltzmann Exploration methods (constant and VDBE) distinguished by buffer size.
  • ...and 2 more figures