Table of Contents
Fetching ...

Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning

Ezgi Korkmaz

Abstract

Following the pivotal success of learning strategies to win at tasks, solely by interacting with an environment without any supervision, agents have gained the ability to make sequential decisions in complex MDPs. Yet, reinforcement learning policies face exponentially growing state spaces in high dimensional MDPs resulting in a dichotomy between computational complexity and policy success. In our paper we focus on the agent's interaction with the environment in a high-dimensional MDP during the learning phase and we introduce a theoretically-founded novel paradigm based on experiences obtained through counteractive actions. Our analysis and method provide a theoretical basis for efficient, effective, scalable and accelerated learning, and further comes with zero additional computational complexity while leading to significant acceleration in training. We conduct extensive experiments in the Arcade Learning Environment with high-dimensional state representation MDPs. The experimental results further verify our theoretical analysis, and our method achieves significant performance increase with substantial sample-efficiency in high-dimensional environments.

Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning

Abstract

Following the pivotal success of learning strategies to win at tasks, solely by interacting with an environment without any supervision, agents have gained the ability to make sequential decisions in complex MDPs. Yet, reinforcement learning policies face exponentially growing state spaces in high dimensional MDPs resulting in a dichotomy between computational complexity and policy success. In our paper we focus on the agent's interaction with the environment in a high-dimensional MDP during the learning phase and we introduce a theoretically-founded novel paradigm based on experiences obtained through counteractive actions. Our analysis and method provide a theoretical basis for efficient, effective, scalable and accelerated learning, and further comes with zero additional computational complexity while leading to significant acceleration in training. We conduct extensive experiments in the Arcade Learning Environment with high-dimensional state representation MDPs. The experimental results further verify our theoretical analysis, and our method achieves significant performance increase with substantial sample-efficiency in high-dimensional environments.
Paper Structure (8 sections, 3 theorems, 14 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 8 sections, 3 theorems, 14 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.4

Let $\eta, \delta > 0$ and suppose that $Q_\theta(s,a)$ is $\eta$-uninformed and $\delta$-smooth. Let $s_t \in \mathcal{S}$ be a state, and let $a^{\textrm{min}}$ be the action minimizing the state-action value in a given state $s_t$, $a^{\textrm{min}} = \mathop{\mathrm{arg\,min}}\limits_a Q_\theta(

Figures (6)

  • Figure 1: Human normalized scores median and 80$^{\textrm{th}}$ percentile over all games in the Arcade Learning Environment (ALE) 100K benchmark for CoAct TD Learning and the canonical temporal difference learning with $\epsilon$-greedy for QRDQN. Left: Median. Right: 80$^{\textrm{th}}$ Percentile.
  • Figure 2: Learning curves in the chain MDP with our proposed algorithm CoAct TD Learning, the canonical algorithm $\epsilon$-greedy and the UCB algorithm with variations in $\epsilon$.
  • Figure 3: Temporal difference for our proposed algorithm CoAct TD Learning and the canonical $\epsilon$-greedy algorithm in the Arcade Learning Environment 100K benchmark. Dashed lines report the temporal difference for the $\epsilon$-greedy algorithm and solid lines report the temporal difference for the CoAct TD Learning algorithm. Colors indicate games.
  • Figure 4: The learning curves for our proposed method CoAct TD Learning algorithm and canonical temporal difference learning in the Arcade Learning Environment with 200 million frame training. Left: JamesBond. MiddleLeft: Gravitar. MiddleRight: Surround. Right: Tennis.
  • Figure 5: Human normalized scores median and 80$^{\textrm{th}}$ percentile over all games in the Arcade Learning Environment (ALE) 100K benchmark in DDQN for CoAct TD Learning algorithm and the canonical temporal difference learning with $\epsilon$-greedy. Left:Median. Right: 80$^{\textrm{th}}$ Percentile.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Definition 3.1: $\eta$-uninformed
  • Definition 3.2: $\delta$-smooth
  • Definition 3.3: Disadvantage Gap
  • Theorem 3.4: Counteractive Actions Increase Temporal Difference
  • proof
  • Definition 3.5: $\delta$-smoothness for Double-$Q$
  • Theorem 3.6
  • proof
  • Proposition 3.7: Marginal and Conditional Distribution of Counteractive Actions