Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning

Ezgi Korkmaz

Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning

Ezgi Korkmaz

Abstract

Following the pivotal success of learning strategies to win at tasks, solely by interacting with an environment without any supervision, agents have gained the ability to make sequential decisions in complex MDPs. Yet, reinforcement learning policies face exponentially growing state spaces in high dimensional MDPs resulting in a dichotomy between computational complexity and policy success. In our paper we focus on the agent's interaction with the environment in a high-dimensional MDP during the learning phase and we introduce a theoretically-founded novel paradigm based on experiences obtained through counteractive actions. Our analysis and method provide a theoretical basis for efficient, effective, scalable and accelerated learning, and further comes with zero additional computational complexity while leading to significant acceleration in training. We conduct extensive experiments in the Arcade Learning Environment with high-dimensional state representation MDPs. The experimental results further verify our theoretical analysis, and our method achieves significant performance increase with substantial sample-efficiency in high-dimensional environments.

Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning

Abstract

Paper Structure (8 sections, 3 theorems, 14 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 8 sections, 3 theorems, 14 equations, 6 figures, 1 table, 1 algorithm.

Introduction
Contributions.
Background and Preliminaries
Maximizing Temporal Difference with Counteractive Actions
Motivating Example
Experimental Results
Investigating the Temporal Difference
Conclusion

Key Result

Theorem 3.4

Let $\eta, \delta > 0$ and suppose that $Q_\theta(s,a)$ is $\eta$-uninformed and $\delta$-smooth. Let $s_t \in \mathcal{S}$ be a state, and let $a^{\textrm{min}}$ be the action minimizing the state-action value in a given state $s_t$, $a^{\textrm{min}} = \mathop{\mathrm{arg\,min}}\limits_a Q_\theta(

Figures (6)

Figure 1: Human normalized scores median and 80$^{\textrm{th}}$ percentile over all games in the Arcade Learning Environment (ALE) 100K benchmark for CoAct TD Learning and the canonical temporal difference learning with $\epsilon$-greedy for QRDQN. Left: Median. Right: 80$^{\textrm{th}}$ Percentile.
Figure 2: Learning curves in the chain MDP with our proposed algorithm CoAct TD Learning, the canonical algorithm $\epsilon$-greedy and the UCB algorithm with variations in $\epsilon$.
Figure 3: Temporal difference for our proposed algorithm CoAct TD Learning and the canonical $\epsilon$-greedy algorithm in the Arcade Learning Environment 100K benchmark. Dashed lines report the temporal difference for the $\epsilon$-greedy algorithm and solid lines report the temporal difference for the CoAct TD Learning algorithm. Colors indicate games.
Figure 4: The learning curves for our proposed method CoAct TD Learning algorithm and canonical temporal difference learning in the Arcade Learning Environment with 200 million frame training. Left: JamesBond. MiddleLeft: Gravitar. MiddleRight: Surround. Right: Tennis.
Figure 5: Human normalized scores median and 80$^{\textrm{th}}$ percentile over all games in the Arcade Learning Environment (ALE) 100K benchmark in DDQN for CoAct TD Learning algorithm and the canonical temporal difference learning with $\epsilon$-greedy. Left:Median. Right: 80$^{\textrm{th}}$ Percentile.
...and 1 more figures

Theorems & Definitions (9)

Definition 3.1: $\eta$-uninformed
Definition 3.2: $\delta$-smooth
Definition 3.3: Disadvantage Gap
Theorem 3.4: Counteractive Actions Increase Temporal Difference
proof
Definition 3.5: $\delta$-smoothness for Double-$Q$
Theorem 3.6
proof
Proposition 3.7: Marginal and Conditional Distribution of Counteractive Actions

Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning

Abstract

Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning

Authors

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (9)