Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation

Jean Seong Bjorn Choe; Jong-Kook Kim

Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation

Jean Seong Bjorn Choe, Jong-Kook Kim

TL;DR

This paper proposes a simple method of separating the entropy objective from the MaxEnt RL objective, which facilitates the implementation of MaxEnt RL in on-policy settings and highlights MaxEnt RL's capacity to enhance generalisation.

Abstract

Entropy Regularisation is a widely adopted technique that enhances policy optimisation performance and stability. A notable form of entropy regularisation is augmenting the objective with an entropy term, thereby simultaneously optimising the expected return and the entropy. This framework, known as maximum entropy reinforcement learning (MaxEnt RL), has shown theoretical and empirical successes. However, its practical application in straightforward on-policy actor-critic settings remains surprisingly underexplored. We hypothesise that this is due to the difficulty of managing the entropy reward in practice. This paper proposes a simple method of separating the entropy objective from the MaxEnt RL objective, which facilitates the implementation of MaxEnt RL in on-policy settings. Our empirical evaluations demonstrate that extending Proximal Policy Optimisation (PPO) and Trust Region Policy Optimisation (TRPO) within the MaxEnt framework improves policy optimisation performance in both MuJoCo and Procgen tasks. Additionally, our results highlight MaxEnt RL's capacity to enhance generalisation.

Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation

TL;DR

Abstract

Paper Structure (22 sections, 1 theorem, 15 equations, 7 figures, 3 tables)

This paper contains 22 sections, 1 theorem, 15 equations, 7 figures, 3 tables.

Introduction
Background
Preliminaries
Soft advantage function
Soft policy gradient theorem
Related works
Proposed method
Overview
Entropy advantage estimation
Entropy critic
Entropy advantage policy optimisation
Experiments
Procgen benchmark environments
Discretised continuous control tasks
MiniGrid-DoorKey-8x8 environment
...and 7 more sections

Key Result

Theorem 1

Let $J(\pi)$ the MaxEnt RL objective defined in eq:soft_obj. And $\pi_\theta(a|s)$ be a parameterised policy. Then,

Figures (7)

Figure 1: The normalised state visitation counts from 100 rollouts with policies trained on the modified MiniGrid-Empty-8x8 task using a naive MaxEnt algorithm (PPO with the augmented entropy reward) and EAPO using 2 different discount factors $\gamma_\mathcal{H} \in (0.9, 0.99)$ and TD$(0)$ for entropy estimation. We compare 3 different temperatures $\tau\in(0.002,0.003,0.004)$. A discount factor $\gamma_V=0.99$ is used for the task reward. $L$ is the mean length of trajectories, with agents aiming to minimise it (10 is optimal). See Appendix \ref{['app:example-empty']} for more details.
Figure 2: Left: Generalisation test results of EAPO agents with $\gamma_\mathcal{H}=0.8$, $\lambda_\mathcal{H}=0.95$ , and two different temperatures $\tau=0.02$ and $\tau=0.005$ against PPO agents with entropy coefficients of $0.001$ and $0.01$ on 16 Procgen cobbe2020leveraging benchmark environments. Agents are evaluated on 100 levels unseen during the training. EAPO consistently outperforms or at least matches PPO in all environments. Results are averaged over 10 seeds, and the shaded area indicates the 95% confidence interval. Right: The mean normalised score for both test and training, computed according to cobbe2020leveraging.
Figure 3: Top: Mean episodic trajectory entropy of EAPO ($\gamma_\mathcal{H}=0.8$, $\lambda_\mathcal{H}=0.95$) and PPO with entropy cofficients $c\in(0.01, 0.001)$, on a subset of Procgen environments during the test. The trajectory entropy of an episode is calculated as the sum of the negative log probability of the actions taken in the episode. Bottom: Mean episodic return of the selected environments during the test and the training. The higher entropy policy ($\tau=0.02$) outperforms the lower entropy policy ($\tau=0.005$) during the test while achieving matching performance during the training (Dodgeball, Leaper) and exhibits a smaller generalisation gap (Dodgeball, Chaser, Leaper).
Figure 4: Performance comparison on 4 MuJoCo tasks. We measured the mean episodic return of the stochastic policy periodically over 100 episodes during the training. Results are averaged from 10 random seeds, and the shaded area indicates the 95% confidence interval. Top: EAPO-PPO. We compare EAPO to the PPO agent with the best-performing entropy coefficient, and with the entropy reward augmented PPO. Bottom: EAPO-TRPO. We also compare with the entropy reward augmented TRPO.
Figure 5: The return and trajectory entropy comparison results of EAPO with $\tau=1e-5$, $\gamma_\mathcal{H}=0.8$ and $\lambda_\mathcal{H}=0$ and PPO with entropy coefficent $0.01$ and $0.001$. Results are averaged from 10 random seeds, and the shaded area indicates the 95% confidence interval.
...and 2 more figures

Theorems & Definitions (1)

Theorem 1: Soft Policy Gradient

Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation

TL;DR

Abstract

Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (1)