Decision Theory-Guided Deep Reinforcement Learning for Fast Learning

Zelin Wan; Jin-Hee Cho; Mu Zhu; Ahmed H. Anwar; Charles Kamhoua; Munindar P. Singh

Decision Theory-Guided Deep Reinforcement Learning for Fast Learning

Zelin Wan, Jin-Hee Cho, Mu Zhu, Ahmed H. Anwar, Charles Kamhoua, Munindar P. Singh

TL;DR

The paper tackles the cold-start challenge in deep reinforcement learning by introducing Decision Theory-guided DRL (DT-guided DRL), which injects problem-specific utility-based guidance into the learning process. It formulates DRL tasks as MDPs and uses PPO, while DT provides a structured utility function for the cart pole and maze environments; these utilities are converted to action probabilities and merged with the neural policy through a decaying weight on the DT component, with a final softmax temperature of 1. Key contributions include a novel framework for fusing decision theory with DRL, empirical benchmarks on cart pole and maze showing substantially higher early rewards (up to 184%) and improved convergence, and an analysis of how DT-guided exploration stabilizes learning in large state spaces. The findings demonstrate that incorporating designer knowledge via utility functions can safely improve initial performance and exploration efficiency, while preserving the ability to learn from interaction data, thereby enabling more robust and practical deployments of DRL in complex domains. This work lays groundwork for further interdisciplinary research at the intersection of decision theory and deep reinforcement learning, with potential impact on safety-critical and real-world decision tasks.

Abstract

This paper introduces a novel approach, Decision Theory-guided Deep Reinforcement Learning (DT-guided DRL), to address the inherent cold start problem in DRL. By integrating decision theory principles, DT-guided DRL enhances agents' initial performance and robustness in complex environments, enabling more efficient and reliable convergence during learning. Our investigation encompasses two primary problem contexts: the cart pole and maze navigation challenges. Experimental results demonstrate that the integration of decision theory not only facilitates effective initial guidance for DRL agents but also promotes a more structured and informed exploration strategy, particularly in environments characterized by large and intricate state spaces. The results of experiment demonstrate that DT-guided DRL can provide significantly higher rewards compared to regular DRL. Specifically, during the initial phase of training, the DT-guided DRL yields up to an 184% increase in accumulated reward. Moreover, even after reaching convergence, it maintains a superior performance, ending with up to 53% more reward than standard DRL in large maze problems. DT-guided DRL represents an advancement in mitigating a fundamental challenge of DRL by leveraging functions informed by human (designer) knowledge, setting a foundation for further research in this promising interdisciplinary domain.

Decision Theory-Guided Deep Reinforcement Learning for Fast Learning

TL;DR

Abstract

Paper Structure (17 sections, 3 equations, 5 figures, 2 tables)

This paper contains 17 sections, 3 equations, 5 figures, 2 tables.

Introduction
Motivation
Key Objectives
Related Work
Platforms for DRL's Performance Analysis
Mitigating Cold Start Problems in DRL
Decision Theory-Guided DRL
Problem Formulation Using DRL
Problem Formulation Using Decision Theory
Cart Pole Environment
Maze Environment
Integrating DT with DRL
Experimental Setup
Simulation Results & Analysis
Performance Analysis of the Cart Pole Problem
...and 2 more sections

Figures (5)

Figure 1: The procedures generating the solutions by a DT-guided DRL agent:$\mathcal{S}_t$ is the state at round $t$ and $\pi(a_t|\mathcal{S}_t)$ is the probability of all actions.
Figure 2: Comparison of training performance in cart pole problem under DT, PPO, DT-guided PPO, SE PPO, and IL PPO using 500 training episodes in accumulated reward. I will change the color so that SE PPO will have the same color in Fig. 2 and Fig. 3
Figure 3: Comparison of training performance under a different maze size ($m$) under DT, PPO, DT-guided PPO, SE PPO, and TL PPO using 500 training episodes in accumulated reward.
Figure 4: Comparison of accumulated rewards under varying a maze size ($m$) with the average of over 500 episodes under DT, PPO, DT-guided PPO, SE PPO, and TL PPO. use pattern fills
Figure 5: Comparison of running time per step under varying a maze size ($m$) with the average of over 500 episodes under DT, PPO, DT-guided PPO, SE PPO, and TL PPO. use pattern fills for color blind people

Decision Theory-Guided Deep Reinforcement Learning for Fast Learning

TL;DR

Abstract

Decision Theory-Guided Deep Reinforcement Learning for Fast Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)