Table of Contents
Fetching ...

ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages

Andrew Jesson, Chris Lu, Gunshi Gupta, Nicolas Beltran-Velez, Angelos Filos, Jakob Nicolaus Foerster, Yarin Gal

TL;DR

It is proved under standard assumptions that restricting policy updates to positive advantages optimizes for value by maximizing a lower bound on the value function plus an additive term, which offers theoretical grounding for spectral normalization of critic weights.

Abstract

This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage estimates, (2) spectral normalization of actor-critic weights, and (3) incorporating \emph{dropout as a Bayesian approximation}. We prove under standard assumptions that restricting policy updates to positive advantages optimizes for value by maximizing a lower bound on the value function plus an additive term. We show that the additive term is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for spectral normalization of critic weights. Finally, our application of dropout corresponds to approximate Bayesian inference over both the actor and critic parameters, which enables \textit{adaptive state-aware} exploration around the modes of the actor via Thompson sampling. We demonstrate significant improvements for median and interquartile mean metrics over A3C, PPO, SAC, and TD3 on the MuJoCo continuous control benchmark and improvement over PPO in the challenging ProcGen generalization benchmark.

ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages

TL;DR

It is proved under standard assumptions that restricting policy updates to positive advantages optimizes for value by maximizing a lower bound on the value function plus an additive term, which offers theoretical grounding for spectral normalization of critic weights.

Abstract

This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage estimates, (2) spectral normalization of actor-critic weights, and (3) incorporating \emph{dropout as a Bayesian approximation}. We prove under standard assumptions that restricting policy updates to positive advantages optimizes for value by maximizing a lower bound on the value function plus an additive term. We show that the additive term is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for spectral normalization of critic weights. Finally, our application of dropout corresponds to approximate Bayesian inference over both the actor and critic parameters, which enables \textit{adaptive state-aware} exploration around the modes of the actor via Thompson sampling. We demonstrate significant improvements for median and interquartile mean metrics over A3C, PPO, SAC, and TD3 on the MuJoCo continuous control benchmark and improvement over PPO in the challenging ProcGen generalization benchmark.
Paper Structure (23 sections, 6 theorems, 43 equations, 13 figures, 9 tables)

This paper contains 23 sections, 6 theorems, 43 equations, 13 figures, 9 tables.

Key Result

Theorem 3.1

Let, $\mathrm{G}_{\mathrm{t}} \coloneqq \sum_{\mathrm{k} = \mathrm{t} + 1}^{\mathrm{T}} \gamma^{\mathrm{k} - 1 - \mathrm{t}} \mathrm{R}_{\mathrm{k}}$, denote the discounted return. Let $q_{\pi}(\mathbf{s}, \mathbf{a}) = \mathbb{E}_{\pi} \left[ \mathrm{G}_{\mathrm{t}} \mid \mathbf{S}_{\mathrm{t}} = maximizes a lower-bound, $v_{\pi}^*(\mathbf{s})$, on the state value function, $v_{\pi}(\mathbf{s})

Figures (13)

  • Figure 1: MuJoCo. Ablating the effect of the proposed mechanisms. Here, we compare VSOP to VSOP without spectral normalization (no-spectral), VSOP without Thompson sampling (no-Thompson), VSOP without advantage clipping (no-ReLU Adv.), and VSOP using all-actions policy optimization (all actions). We see that no single mechanism contributes greater than the sum of all changes, lending credence to the validity of our theory. The overall performance (a-b) and sample efficiency (c-d) metrics illustrate this result. Metrics are computed wrt to the average episodic return of the last 100 episodes and the area under the episodic return curve over ten random seeds
  • Figure 2: Comparing the effect of VSOP mechanisms on Mujoco continuous control performance. Using the single action framework and updating the policy only on positive advantage estimates have the largest effects, followed by spectral normalization, and finally Thompson sampling. Blue lines (VSOP) show the optimized proposed method. Orange lines (no-Thompson) show VSOP without Thompson sampling. Green lines (no-Spectral) show VSOP without spectral normalization. Pink lines (all actions) show VSOP with "all actions". Red lines (no ReLU Adv.) show VSOP without restricting policy updates to positive advantages.
  • Figure 3: MuJoCo. Comparison to baselines. We see that VSOP (blue) shows significant improvement over each baseline concerning the Median and IQM metrics. VSOP only trails SAC and TD3 for the mean metric. Metrics are computed wrt to the average episodic return of the last 100 episodes over 10 random seeds
  • Figure 4: MuJoCo. Comparison to on-policy baselines with extreme parallelization. We compare VSOP to on-policy baselines on MuJoCo with 2048 threads and 10 steps per rollout. Metrics are computed wrt to the average episodic return of the last 100 episodes over 20 random seeds
  • Figure 5: MuJoCo: effect of parallelization on VSOP. Naming convention: #threads/#steps/spectral norm. We see that VSOP is most effective \ref{['fig:spectral-performance']} and most efficient \ref{['fig:spectral-efficiency']} in lower thread settings for a fixed rollout size of 2048 steps when using spectral normalization. Metrics are computed wrt to the average episodic return or area under the curve for the last 100 episodes over 5 random seeds
  • ...and 8 more figures

Theorems & Definitions (11)

  • Theorem 3.1
  • Theorem 2.1
  • proof
  • Lemma 2.1
  • proof
  • Lemma 2.2
  • proof
  • Lemma 2.3
  • proof
  • Lemma 2.4
  • ...and 1 more