Dealing with uncertainty: balancing exploration and exploitation in deep recurrent reinforcement learning

Valentina Zangirolami; Matteo Borrotti

Dealing with uncertainty: balancing exploration and exploitation in deep recurrent reinforcement learning

Valentina Zangirolami, Matteo Borrotti

TL;DR

It is shown that adaptive methods better approximate the trade-off between exploration and exploitation and, in general, Softmax and Max-Boltzmann strategies outperform epsilon-greedy techniques.

Abstract

Incomplete knowledge of the environment leads an agent to make decisions under uncertainty. One of the major dilemmas in Reinforcement Learning (RL) where an autonomous agent has to balance two contrasting needs in making its decisions is: exploiting the current knowledge of the environment to maximize the cumulative reward as well as exploring actions that allow improving the knowledge of the environment, hopefully leading to higher reward values (exploration-exploitation trade-off). Concurrently, another relevant issue regards the full observability of the states, which may not be assumed in all applications. For instance, when 2D images are considered as input in an RL approach used for finding the best actions within a 3D simulation environment. In this work, we address these issues by deploying and testing several techniques to balance exploration and exploitation trade-off on partially observable systems for predicting steering wheels in autonomous driving scenarios. More precisely, the final aim is to investigate the effects of using both adaptive and deterministic exploration strategies coupled with a Deep Recurrent Q-Network. Additionally, we adapted and evaluated the impact of a modified quadratic loss function to improve the learning phase of the underlying Convolutional Recurrent Neural Network. We show that adaptive methods better approximate the trade-off between exploration and exploitation and, in general, Softmax and Max-Boltzmann strategies outperform epsilon-greedy techniques.

Dealing with uncertainty: balancing exploration and exploitation in deep recurrent reinforcement learning

TL;DR

It is shown that adaptive methods better approximate the trade-off between exploration and exploitation and, in general, Softmax and Max-Boltzmann strategies outperform epsilon-greedy techniques.

Abstract

Paper Structure (25 sections, 22 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 22 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Related literature
Exploration strategies
Function gradient-based RL
Autonomous driving
Background and preliminaries
Multi-Armed Bandits and Contextual Bandits
Reinforcement Learning
Exploration strategies
Recurrent Reinforcement Learning
Deep Recurrent Q-Learning for autonomous driving
Observation space
Action space and Reward
Agent's architecture
Exploration strategies for recurrent learning
...and 10 more sections

Figures (7)

Figure 1: Convolutional Neural Network with LSTM layer. Each pre-processed input image was processed by three convolutional layers and the resulting activations were performed through LSTM layer. Q-values were estimated by dividing the latter output into advantage and value fully-connected layers respectively and combining them together.
Figure 2: Pre-processing of car's front camera images. The two-step pre-processing procedure is illustrated. Initially, each input image undergoes cropping at the top and bottom to concentrate the view of the road. Subsequently, downsampling is employed on the images before feeding them into the neural network.
Figure 3: AirSim NH environment. The Neighborhood Environment is shown through the AirSim simulation platform.
Figure 4: Comparison of the exploration strategies with D3RQN agent. The training curves show the average reward per 100 episodes for each value of the buffer size. The horizontal axis and the vertical axis indicate, respectively, the number of episodes and the average reward.
Figure 5: Comparison of $\epsilon$ values for $\epsilon$-greedy and Max-Boltzmann methods. The horizontal axis and the vertical axis indicate, respectively, the number of environment steps and the $\epsilon$ value. Figure (a) shows $\epsilon$ values per steps for deterministic $\epsilon$-greedy strategies. Figure (b) shows $\epsilon$ values per step for adaptive $\epsilon$-greedy strategies (BMC and VDBE) distinguished by buffer size. Figure (c) shows $\epsilon$ values per step for Max-Boltzmann Exploration methods (constant and VDBE) distinguished by buffer size.
...and 2 more figures

Dealing with uncertainty: balancing exploration and exploitation in deep recurrent reinforcement learning

TL;DR

Abstract

Dealing with uncertainty: balancing exploration and exploitation in deep recurrent reinforcement learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)