Dynamic Deep-Reinforcement-Learning Algorithm in Partially Observable Markov Decision Processes

Saki Omi; Hyo-Sang Shin; Namhoon Cho; Antonios Tsourdos

Dynamic Deep-Reinforcement-Learning Algorithm in Partially Observable Markov Decision Processes

Saki Omi, Hyo-Sang Shin, Namhoon Cho, Antonios Tsourdos

TL;DR

This study discusses the effectiveness of the inclusion of action along with observation and the impact of network architecture to handle them by providing interpretations of how the trajectories are summarized at LSTM networks by introducing three novel approaches with different architectures.

Abstract

Recent studies have greatly improved reinforcement learning, and an increased interest in real-world implementation has emerged. In many cases, the implementation is challenged by time-varying disturbances as it introduces hidden states, which makes the problem best described with Partially Observable Markov Decision Processes. An effective approach to address this problem is to introduce a Recurrent Neural Network (RNN) in place of a state estimator. However, only a few studies have investigated the types of information to be supplied to the RNN and the network architecture to handle them. This study discusses the effectiveness of the inclusion of action along with observation and the impact of network architecture to handle them by providing interpretations of how the trajectories are summarized at LSTM networks. Specifically, three novel approaches with different architectures are introduced. All algorithms demonstrated the effectiveness of the inclusion of the action trajectories in simulation environments. In particular, one of the developed algorithms, H-TD3, differs from the typical actor and critic network as the critic network is trained by utilizing the hidden states generated by the actor network as the summarized trajectory information. This novel approach exhibited the potential improvement of the computational time while maintaining the performance.

Dynamic Deep-Reinforcement-Learning Algorithm in Partially Observable Markov Decision Processes

TL;DR

Abstract

Paper Structure (33 sections, 1 equation, 14 figures, 4 tables)

This paper contains 33 sections, 1 equation, 14 figures, 4 tables.

Introduction
Related Work
RL in Dynamic Environments
Action Sequence Inclusion
History Length
Network Architecture
Computational Efficiency
Preliminaries
Dynamic RL Adapting to Environment with Disturbance
Causality and Statistics
Analysis of Disturbance
POMDP Formulation
Information States, Belief States in POMDP
Identification of Transition Model
LSTM to Reflect the Sequence in the Internal Representation $s^{*}_t$
...and 18 more sections

Figures (14)

Figure 1: LSTM-TD3 schematics
Figure 2: Causal diagram. (a) With dynamic disturbance. (b) With non-dynamic disturbance
Figure 3: Results for Fully Observable MDP Case
Figure 4: Results for Scenario 1 - Temporal Bias
Figure 5: Results for Scenario 2 - Temporal Sinusoidal Wave
...and 9 more figures

Dynamic Deep-Reinforcement-Learning Algorithm in Partially Observable Markov Decision Processes

TL;DR

Abstract

Dynamic Deep-Reinforcement-Learning Algorithm in Partially Observable Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Figures (14)